Remix.run Logo
Animats 2 days ago

A good question is whether you can grind down a model specialized for, say, customer service for your products, down to where it's really cheap to run on an ordinary server, maybe with a GPU card.

Are we really going to need all those giant AI data centers?

vasco 2 days ago | parent | next [-]

Our brain works on a couple of bananas, so at least the amount of energy required for just inference doesn't look like it needs to be a lot. Training is another subject because we have that embedded in DNA and cultural behavior so its trickier.

xwolfi 2 days ago | parent | next [-]

Well yeah you have to look at the entire training duration for your brain. It did take a while to be as perfect as you seem to be, several billion years, and I'm sure you make mistakes sometimes and hallucinate stupid ideas.

And don't run too long on a couple bananas, the brain is not just there to infer, it also needs to manage its autonomous transport system which requires much more energy itself.

seer 2 days ago | parent | prev | next [-]

Well in this analogy “training” is the thousands of cycles of sleep and moving and rearranging the brain cell connections that happens at night. That is _a lot_ of bananas, though obviously not all of the energy of growing up goes to brain re-arranging.

Still - shouldn’t be no more than a few buckets of fat, if you only do the nrem “training” bit of sleep.

stingraycharles 2 days ago | parent [-]

No, that’s reinforcement learning and small incremental model updates. The real initial training & model deployment is more akin to DNA. Models cannot “learn” the same way humans do.

TheFuzzball 2 days ago | parent | prev | next [-]

> Our brain works on a couple of bananas

What a fantastic non sequitur

2 days ago | parent [-]
[deleted]
pama 2 days ago | parent | prev [-]

Silicon is already more efficient for inference than the brain. If we use centralized decoding of the V3/R1 scale models as a baseline, one can produce 720,000 tokens (a wild guess for the tokens humans could produce in 24 hours) using the energy of only 0.36 bananas. Deeply thinking humans expend up to a a third of their total energy on the brain, but cannot sustain themselves on a single banana per day.

(You can use an LLM to check this work at the cost of a tiny speck of a banana, eg: https://grok.com/share/c2hhcmQtMw%3D%3D_60f4890d-711b-4331-9... )

Vetch 2 days ago | parent | next [-]

The brain is certainly vastly more energy efficient at inference than LLMs on GPUs. But it looks like you're trying to make a different argument, that an LLM can spend less energy than a human to complete a given task. Unfortunately, you have not made that argument and I won't be reading unverified LLM output that might contain hallucinated steps or claims.

> V3/R1 scale models as a baseline, one can produce 720,000 tokens

On what hardware? At how many tokens per second? But most importantly, at what quality? I can use a PRNG to generate 7 billion tokens at a fraction of the energy use of an LLM but those tokens are not going to be particularly interesting. Simply counting how many tokens can be generated in a given time frame is still not a like for like comparison. To be complete, the cost required to match human level quality, if possible, also needs accounting for.

> Deeply thinking humans expend up to a a third of their total energy on the brain

Where did you get this from? A 70B LLM? It's wrong or at best, does not make sense. The brain barely spends any more energy above its baseline when thinking hard (often not much more than 5%). This is because most of its energy use is spent on things like up-keep and maintaining resting membrane potential. Ongoing "Background activity" like the DMN also means the brain is always actively computing something interesting.

pama a day ago | parent | next [-]

> > V3/R1 scale models as a baseline, one can produce 720,000 tokens On what hardware? At how many tokens per second? But most importantly, at what quality?

The hardware is the GB200 NVL72 by NVidia. This is for the class of 671B DeepSeek models, eg R1-0528 or V3, with their full accuracy setup (ie reproducing the quality of the reported DeepSeek benchmarks). Here is the writeup (by humans; the second figure shows the tokens per second per GPU as a function of the batch size, which emphasizes the advantages of centralized decoding, compared to current hacks at home): https://lmsys.org/blog/2025-06-16-gb200-part-1/

And here are the instructions to replicate the particular benchmark: https://github.com/sgl-project/sglang/issues/7227

The LLM text I linked in my original answer carries out the math using the energy consumption of the NVidia hardware setup (120kW) and rather simple arithmetic, which you can reproduce.

ben_w 2 days ago | parent | prev [-]

I agree with you that quality is the most important question, for similar reasons.

I don't think that current models are at expert level, but they do seem to be reliably good enough to be useful and pass standardised tests and be generally quite solidly in the "good enough you have to pay close attention for a while before you notice the stupid mistake" area that makes them very irritating for anyone running job interviews or publishing books etc.

And worse, I also think the numbers you're replying to are, at best, off by a few decimal places.

If I take the 0.36 bananas (which was already suspicious) and USD 0.1 / kWh, I get 0.004 USD. If I scale that up to by 1/0.72 to get a megatoken, that's still only 5/9ths of a cent.

If I make the plausible but not necessarily correct assumption that OpenAI's API prices reflect the cost of electricity, none of their models are even remotely that cheap. It's close enough to the cost of their text-embedding-3-small (per megatoken) to be within the fudge-factor of my assumption about how much of their prices are electricity costs, but text-embedding are much much weaker than transformer models, to the point they're not worth considering in the same discussion unless you're making an academic point.

> It's wrong or at best, does not make sense. The brain barely spends any more energy above its baseline when thinking hard (often not much more than 5%).

Indeed.

Now I'm wondering: how much power does the human brain use during an epileptic fit? That seems like it could plausibly be 70% of calories for a the few seconds of the seizure? But I've only got GCSE grade C in biology, so even with what I picked up the subsequent 25 years of general geeking, my idea of "plausible" is very weak.

pama a day ago | parent [-]

> If I make the plausible but not necessarily correct assumption that OpenAI's API prices reflect the cost of electricity, none of their models are even remotely that cheap

This assumption is very wrong. The primal cost factor in inference is the GPU itself. NVidia’s profit margins are very high; so are OpenAI’s margins for the API usage, even after taking into account the costs of the GPU. You can understand their margins if you read about inference at scale, and the lmsys blog in my parallel answer is a decent eye opener if you thought that companies sell tokens close to the price of electricity.

pama 20 hours ago | parent [-]

An alternative and perhaps easier way to estimate the relative importance of the GPU cost vs the electricity cost is to estimate how many years of constant use of the GPU at full power you need for the cost of industrial-scale electricity to catch up to the cost of the industrial scale GPU pricing. The H200 had 700W max power draw and about 40k USD cost (price varies a lot); typical lowest rental price a year ago was 2USD/h, possibly a bit lower by now. In 1h you could not even spent 1kWh electricity with them in optimal compute conditions, yet, at scale, you can negotiate prices lower than 0.05 USD per kWh of electricity at some parts of the US. Alternatively, assume 0.05 USD per kWh, and use the GB200 NVL72 that draws 120kW at peak. That is a cost of 6USD/hour or $52.6k per year. Even if one were to use the hardware for 10 years straight without problems at peak performance, the cost of electricity is way cheaper than the cost of the hardware itself (you have to ask NVidia for a quote, but expect a multi-million dollar tag and they have no shortage of customers ready to pay.)

bildung 2 days ago | parent | prev [-]

Well compared to the human brain LLMs do approximately zero work. An LLM neuron is at least 3 orders of magnitude less complex than a neuron in the human brain - and this factor only accounts for the neuronal instrinsics we currently know of.

ben_w 2 days ago | parent | next [-]

Agreed. I think this means the fair comparison is either:

  "transistors vs. *synapses*"
or

  "an entire integrated computer with all necessary cooling, including a modifier to account for the amortised training effort required to achieve human-quality output vs. the amortised energy requirements and output of a human over their lifetime".
Has to be human-quality output to be a fair comparison, a million lines of gibberish is worthless.

The human has to be educated up until 21 or so to be economically viable, retires in their late 60s, works 25% of the hours in a working week (but not at all on non-working week e.g. holiday, sickness, periods of unemployment, and while parental leave is work it isn't the specific work that people want to pay you for), and the brain itself is only ~20% of a human's calorific consumption.

In the (currently quite small number of) tasks where the AI we have is good enough to replace human labour, for some models it is already in the range where the marginal energy cost for inference is smaller than the energy cost (in food calories) to get a human to do the same thing.

But also, last I checked the peak performance of LLMs is not as high as a domain expert at anything, so even infinite cost into the AI isn't going to equal them. On the other hand, human intelligence is not equal for all of us, so I find it very easily believe that there's a significant fraction of the population who will always, over their lifetime, be behind today's SOTA AI, and therefore infinite time and energy for them isn't every going to equal the AI we already have.

pama 2 days ago | parent | prev [-]

Agreed. And that near zero work has a near zero energy cost. In addition, silicon inference (combining hardware and software advances) continues to be optimized and become more energy efficient at a rapid rate.

There exists an unfounded myth surrounding the extreme energy costs of silicon-based inference, which is far from reality.

ben_w 2 days ago | parent | prev | next [-]

We've already got distilled down versions of models designed to fit on consumer-sized devices, they are definitely not as performant as the bigger models.

But the models are RAM limited not compute limited, and there's no reason consumer devices need to have their current RAM limits. Get 256 GB of RAM in your phone and an LLM may drain the battery in 15 minutes, and I have no idea about the bus bandwidth, but the NPU (e.g. Neural Engine in Apple SoCs for the last few years) is already enough for the compute part of the problem.

yummybear 2 days ago | parent | prev | next [-]

Even further - could it download a distilled modeb runtime in response to your type of question - if we’re talking vacation planning download vacation.model for 10 seconds and then let’s talk?

dragochat 2 days ago | parent | prev | next [-]

YES

We'll always find uses for more intelligence if it keeps getting more and more general (I don't like the term AGI bc. I think the "G" there is quantity not a quality, and humans are very low on generality too compared to what could be mathematically and physically possible for intelligence in our universe).

...we won't stop until the planet is papered with compute hardware UNLES we accelerate space development too (that's why SPACE is CRUCIAL!) and go grind the asteroid belt into thousands of datacenters too, then on and on.

There's a whole yummy lightcone that awaits to be eaten :P

msgodel 2 days ago | parent | prev [-]

You could probably use some heuristic on the tokens trained to try to weight customer service related data higher.