Per-query energy consumption of LLMs
16 points by avsm
16 points by avsm
To put this into perspective of LLMs at scale: the numbers for Google Gemini are 0.24Wh/prompt and 0.26ml of Water/prompt. Source
it's also good to keep in mind that the actual impact is not just the grand total of expended energy for querying or training, and the pollution caused by it, but also how the technology is used.
numbers like these have a tendency to downplay pollution and can be considered ‘selective disclosures’, as mentioned in this recent article in the guardian about ai's impact on pollution (quoting various reputable researchers), which paints a rather grim picture, e.g. the tech being used to expand oil extraction, etc.
“What I’m worried about is that we’re deploying AI in such a way that we don’t have a good idea of the energy use,” said Sasha Luccioni, climate lead at AI company Hugging Face, who has grown frustrated by “selective disclosures” from big companies that obscure the real climate impact of their products. – direct link to source text
For folks who denominate in USA's dollars: at the current cost of electricity in The Dalles, $0.0599/kWh, that's approximately 0.0014376¢, and at the current cost of (lots of) water in The Dalles, $1.87/kgal, that's approximately 0.04862¢. I'm going to round and say that an individual Gemini query costs Google about 0.05¢, one-twentieth of one penny. For what it's worth, that's slightly cheaper than an individual Google search query used to be not long ago; it would seem that Gemini is roughly as expensive as Search on a per-query basis, which might well have been a design goal.
I did reference that work in the intro. I'm glad Google have published something but it's a little hard to interpret what it might mean for your own individual usage given the length of the median prompt they use is not disclosed (does that look like the prompts I generate or not?), and they don't disclose what model is used in their estimations for that median prompt (is it one I would choose to use? Ideally you'd have an idea of the size as well of course).
Though mentioning that reported Wh/prompt somewhere in the results section might be a good idea - I may do that when I next tweak the post.
It's very hard to say these days how prompt length influences this given how these systems work. For instance if you consider agentic workloads and caching can have a bigger impact on latency than prompt length itself. I'm assuming this also shows up in the energy usage associated.
I totally agree that what you'd expect in an agentic workload - a much larger context and a heavy reliance on caching of input tokens across many calls - isn't really represented by the InferenceMAX benchmark. As I allude to at the end of the post, the move towards linear attention mechanisms at least simplifies things a little bit as you no longer need to worry than the Nth token is significantly more expensive to generate than the (for instance) the N/2nd token. But you'd still be left wanting to add in some overhead for the cost of retrieving and writing to the cache, and then figure out what impact the higher average context size has on the achievable concurrency.
For Google's figures, I suppose I should have said the prompt length in terms of the number of cached and uncached input tokens along with the median output sequence length would be helpful to know as a yardstick, to at least consider if it looks anything like the kind of workload you are interested in.
Great work. I couldn't find a reference to this in the post, but do you know if InferenceMAX uses batching? Processing multiple prompts in parallel is more energy-efficient (at least on consumer GPUs).
Thanks! It ended up being much more of a rabbithole than I expected.
To answer your question yes, very much so - there's no way to get these kind of throughput results without batching. As a fun example here you can see the AMD config squeezing out all they can by specializing for every feasible batch size rather than relying on the default dynamic batch size support. https://github.com/InferenceMAX/InferenceMAX/blob/a075f2e7c9f3c2dd7e3411067ef5933ac8f44810/benchmarks/gptoss_fp4_mi355x_docker.sh (see https://blog.vllm.ai/2025/08/20/torch-compile.html#dynamic-batch-sizes-and-specialization for background).
Would be nice if companies would be required to make energy consumption per "usage field" public.
Also simply from a data perspective. Since currently a lot of it seems to be educated guesses which might hide interesting potentials.
As a side effect, it would force the companies to develop an idea how to track all this sector-by-sector… (My guess is that the companies don't just refuse to disclose things, but also track it in a way not aligned with human understanding of types of use)