The brain is certainly vastly more energy efficient at inference than LLMs on GPUs. But it looks like you're trying to make a different argument, that an LLM can spend less energy than a human to complete a given task. Unfortunately, you have not made that argument and I won't be reading unverified LLM output that might contain hallucinated steps or claims.
> V3/R1 scale models as a baseline, one can produce 720,000 tokens
On what hardware? At how many tokens per second? But most importantly, at what quality? I can use a PRNG to generate 7 billion tokens at a fraction of the energy use of an LLM but those tokens are not going to be particularly interesting. Simply counting how many tokens can be generated in a given time frame is still not a like for like comparison. To be complete, the cost required to match human level quality, if possible, also needs accounting for.
> Deeply thinking humans expend up to a a third of their total energy on the brain
Where did you get this from? A 70B LLM? It's wrong or at best, does not make sense. The brain barely spends any more energy above its baseline when thinking hard (often not much more than 5%). This is because most of its energy use is spent on things like up-keep and maintaining resting membrane potential. Ongoing "Background activity" like the DMN also means the brain is always actively computing something interesting.
> > V3/R1 scale models as a baseline, one can produce 720,000 tokens
On what hardware? At how many tokens per second? But most importantly, at what quality?
The hardware is the GB200 NVL72 by NVidia. This is for the class of 671B DeepSeek models, eg R1-0528 or V3, with their full accuracy setup (ie reproducing the quality of the reported DeepSeek benchmarks). Here is the writeup (by humans; the second figure shows the tokens per second per GPU as a function of the batch size, which emphasizes the advantages of centralized decoding, compared to current hacks at home): https://lmsys.org/blog/2025-06-16-gb200-part-1/
The LLM text I linked in my original answer carries out the math using the energy consumption of the NVidia hardware setup (120kW) and rather simple arithmetic, which you can reproduce.
I agree with you that quality is the most important question, for similar reasons.
I don't think that current models are at expert level, but they do seem to be reliably good enough to be useful and pass standardised tests and be generally quite solidly in the "good enough you have to pay close attention for a while before you notice the stupid mistake" area that makes them very irritating for anyone running job interviews or publishing books etc.
And worse, I also think the numbers you're replying to are, at best, off by a few decimal places.
If I take the 0.36 bananas (which was already suspicious) and USD 0.1 / kWh, I get 0.004 USD. If I scale that up to by 1/0.72 to get a megatoken, that's still only 5/9ths of a cent.
If I make the plausible but not necessarily correct assumption that OpenAI's API prices reflect the cost of electricity, none of their models are even remotely that cheap. It's close enough to the cost of their text-embedding-3-small (per megatoken) to be within the fudge-factor of my assumption about how much of their prices are electricity costs, but text-embedding are much much weaker than transformer models, to the point they're not worth considering in the same discussion unless you're making an academic point.
> It's wrong or at best, does not make sense. The brain barely spends any more energy above its baseline when thinking hard (often not much more than 5%).
Indeed.
Now I'm wondering: how much power does the human brain use during an epileptic fit? That seems like it could plausibly be 70% of calories for a the few seconds of the seizure? But I've only got GCSE grade C in biology, so even with what I picked up the subsequent 25 years of general geeking, my idea of "plausible" is very weak.
> If I make the plausible but not necessarily correct assumption that OpenAI's API prices reflect the cost of electricity, none of their models are even remotely that cheap
This assumption is very wrong. The primal cost factor in inference is the GPU itself. NVidia’s profit margins are very high; so are OpenAI’s margins for the API usage, even after taking into account the costs of the GPU. You can understand their margins if you read about inference at scale, and the lmsys blog in my parallel answer is a decent eye opener if you thought that companies sell tokens close to the price of electricity.
An alternative and perhaps easier way to estimate the relative importance of the GPU cost vs the electricity cost is to estimate how many years of constant use of the GPU at full power you need for the cost of industrial-scale electricity to catch up to the cost of the industrial scale GPU pricing. The H200 had 700W max power draw and about 40k USD cost (price varies a lot); typical lowest rental price a year ago was 2USD/h, possibly a bit lower by now. In 1h you could not even spent 1kWh electricity with them in optimal compute conditions, yet, at scale, you can negotiate prices lower than 0.05 USD per kWh of electricity at some parts of the US. Alternatively, assume 0.05 USD per kWh, and use the GB200 NVL72 that draws 120kW at peak. That is a cost of 6USD/hour or $52.6k per year. Even if one were to use the hardware for 10 years straight without problems at peak performance, the cost of electricity is way cheaper than the cost of the hardware itself (you have to ask NVidia for a quote, but expect a multi-million dollar tag and they have no shortage of customers ready to pay.)
That math is for random projections? Note that JL lemma is a worst case guarantee and in practice, there's a lot more distortion tolerance than the given bounds would suggest. Concepts tend to live in a space of much lower intrinsic dimensionality than the data's and we often care more about neighbor and rank information than precise pair-wise distances.
Also, JL is only a part of the story for the transformers.
Johnson-Lindenstrauss is an example of a probabilistic existence argument: the probability of a random projection having low error is nonzero, therefore a low-error projection must exist. That doesn't mean any given random projection can be expected to have low error, although if you keep rerolling often enough, you'll eventually find one.
The existence argument does only provide a lower bound on the number of dimensions that can be represented with low error, but there's not necessarily much room for improvement left.
Compounding with learn and iterate, humans also build abstractions which significantly shorten the number of steps required. These are more expressive programming languages, compilers and toolchains. We also build engines, libraries, DSLs and invent appropriate data-structures to simplify the landscape or reuse existing work. Besides abstractions, we build tools like better type systems, error testing and borrow checkers to help eliminate certain classes of errors. Finally, after all is said and done, we still have QA teams and major bugs.
> he's inability to see its application to modern compute held the field back by years.
I find Schmidhuber's claim on GANs to be tenuous at best, but his claim to have anticipated modern LLMs is very strong, especially if we are going to be awarding nobel prizes for Boltzmann Machines. In https://people.idsia.ch/%7Ejuergen/FKI-147-91ocr.pdf, he really does concretely describe a model that unambiguously anticipated modern attention (technically, either an early form of hypernetworks or a more general form of linear attention, depending on which of its proposed update rules you use).
I also strongly disagree with the idea that his inability to practically apply his ideas held anything back. In the first place, it is uncommon for a discoverer or inventor to immediately grasp all the implications of and applications of their work. Secondly, the key limiter was parallel processing power; it's not a coincidence ANNs took off around the same time GPUs were transitioning away from fixed function pipelines (and Schmidhuber's lab were pioneers there too).
In the interim, when most derided Neural networks, his lab was one of the few that kept research on Neural networks and their application to sequence learning going. Without their contributions, I'm confident Transformers would have happened later.
> It's clear to me no one read his early paper's when developing GANs
This is likely true.
> self-supervision/transformers.
This is not true. Transformers came after lots of research on sequence learners, meta-learning, generalizing RNNs and adaptive alignment. For example, Alex Graves' work on sequence transduction with RNNs eventually led to the direct precursor of modern attention. Graves' work was itself influenced by work with and by Schmidhuber.
The non-o-series models from OpenAI and non-Opus (although I have not tried the latest, so it's possible that it too joins them) from Anthropic are cloyingly sycophantic, with every other sentence of yours containing a brilliant and fascinating insight.
It's possible that someone already on the verge of a break or otherwise in a fragile state of mind asking for help with their theories could end up with an LLM telling them how incredibly groundbreaking their insights are, perhaps pushing them quicker, deeper more unmoored in the direction they were already headed.
This contains a common misstep (or misgeneralization of an analogy) among those who are much more familiar with computers than with the brain. The brain is not digital and concepts like frames per second and resolution don't make much sense for vision. First, there aren't frames, neuron activity is asynchronous with changes to sensory neuron firing rate responding to changes in the environment or according to saliency.
Between the non-uniformity of receptor density (eg fovea vs peripheral vision but this is general across all senses), dynamic receptor fields and the fact that information is encoded in terms of spike rate and timing patterns across neural populations, the idea of pixels in some bitmap at some resolution is beyond misleading. There is no pixel data, just sparsely coded feature representations capturing things like edges, textures, motion, color contrast and the like, already, at the retina.
While hundreds of trillions of photons might hit our photoreceptors, > 99% of that is filtered and or compressed before even reaching retinal ganglion cells. Only a tiny fraction, about 10 million bits/sec, of the original photon signal rate is transferred through the optic nerve (per eye). This pattern of filtering and attentive prioritization of information in signals continues as we go from sensory fields to thalamus to higher cortical areas.
So while we might encounter factoids like: on the order of a billion bits per second of data hit photoreceptors or [10Mb/s transferred](https://www.britannica.com/science/information-theory/Physio...) along optic nerves, it's important to keep in mind that a lot of the intuition gained from digital information processing does not transfer in any meaningful sense to the brain.
If you consider the entire biological pipeline then the filtering is part of that. The quantity of raw data remains much greater than that available to any vision model. If anything the filtering done by biology should make it clear that there's vast room for model architecture improvement.
I think the point remains that few have been able to catch up to OpenAI. For a while it was just Anthropic. Then Google after failing a bunch of times. So, if we relax this to LLMs not by OpenAI, Anthropic or Google, then Deepseek is really the only one that's managed to reach their quality tier (even though many others have thrown their hat into the ring). We can also get approximate glimpses into which models people use by looking at OpenRouter, sorted by Top Weekly.
In the top 10, are models by OpenAI (gpt4omini), Google (gemini flashes and pros), Anthropic (Sonnets) and Deepseeks'. Even though the company list grows shorter if we instead look at top model usage grouped by order of magnitude, it retains the same companies.
Personally, the models meeting my quality bar are: gpt 4.1, o4-mini, o3, gpt2.5pro, gemini2.5flash (not 2.0), claude sonnet, deepseek and deepseek r1 (both versions). Claude Sonnet 3.5 was the first time I found LLMs to be useful for programming work. This is not to say there are no good models by others (such as Alibaba, Meta, Mistral, Cohere, THUDM, LG, perhaps Microsoft), particularly in compute constrained scenarios, just that only Deepseek reaches the Quality tier of the big 3.
There is skill to it but that's certainly not the only relevant variable involved. Other important factors are:
Language: Syntax errors rise, and a common form is the syntax of a more common language bleeding through.
Domain: Less so than what humans deem complex, quality is more strongly controlled by how much code and documentation there is for a domain. Interesting is that if in a less common subdomain, it will often revert to a more common approach (for example working on shaders for a game that takes place in a cylinder geometry requires a lot more hand-holding than on a plane). It's usually not that they can't do it, but that they require much more involved prompting to get the context appropriately set up and then managing drifting to default, more common patterns. Related is decisions with long term consequences. LLMs are pretty weak at this. In humans this one comes with experience, so it's rare and an instance of low coverage.
Dates: Related is reverting to obsolete API patterns.
Complexity: While not as dominant as domain coverage, complexity does play a role. With likelihood of error rising with complexity.
This means if you're at the intersection of multiple of these (such as a low coverage problem in a functional language), agent mode will likely be too much of a waste for you. But interactive mode can still be highly productive.
LeBron is one of the rare individuals at that intersection of high athleticism and mental capability. It's why at the age of 40, well past his athletic prime, he's still a top NBA player. He has Magnus-level chunking ability enabling prodigious memory for games, he has fast processing and court vision, being able to leverage symmetries to automatically adjust for current player orientations to predict opponent plays. It's what allows him to make passes that seem impossible--he sees windows open up based on predicted player movements, not just current positions. Like that famous Wayne Gretzky quote.
It's a super rare archetype of athleticism/size+mental that only the likes of LeBron, Jokic and Magic Johnson have occupied (not meant to be an exhaustive list).
The essence of the article is that self-correction exists as a nascent ability in base models already (more robustly in some like Qwen than others). This is highly reminiscent of Chain of Thought, which was found to be a capability already present in base models too. The result of RL is to reinforce already present authentic self-correction patterns and down weight superficial self-correction.
Thoughts:
- An analogy you shouldn't zoom too close into is going from CoT to reasoning traces is like going from purely ballistic trajectories to including navigation and thrusters. RL is for learning how to use the thrusters for adjustments based on its internal encodings of rare samples† where some author fully spelled out their thought process.
- This might also explain why SFT on reasoner traces seems to be surprisingly effective. If it were purely an RL mediated phenomenon, SFT for reasoning would not work nearly as well.
- Deepseek struggled to get RL to work on smaller models, if this is replicated, it might be the case that larger models encode self-correction patterns more robustly while having them as more probable.
- Imitating traces is easier than pure RL for bringing such patterns to the fore, for smaller models. However, we still want models to learn how to dynamically adjust their thrusters, SFT does not provide ample opportunity for this. Further training with RL or alternatively, replacing SFT with methods like [Critique Fine-Tuning](https://arxiv.org/abs/2501.17703) are needed.
- The article incidentally reinforces that having a low temperature means consistency not correctness. Except for high confidence scenarios, the highest greedily computed probability answer is generally less likely to be among the best ones the model can give.
†Question: First thought is blogs by people who discuss what didn't work. But, I wonder how much of reasoning model patterns and ability is shaped by Detective Conan transcripts?
> V3/R1 scale models as a baseline, one can produce 720,000 tokens
On what hardware? At how many tokens per second? But most importantly, at what quality? I can use a PRNG to generate 7 billion tokens at a fraction of the energy use of an LLM but those tokens are not going to be particularly interesting. Simply counting how many tokens can be generated in a given time frame is still not a like for like comparison. To be complete, the cost required to match human level quality, if possible, also needs accounting for.
> Deeply thinking humans expend up to a a third of their total energy on the brain
Where did you get this from? A 70B LLM? It's wrong or at best, does not make sense. The brain barely spends any more energy above its baseline when thinking hard (often not much more than 5%). This is because most of its energy use is spent on things like up-keep and maintaining resting membrane potential. Ongoing "Background activity" like the DMN also means the brain is always actively computing something interesting.