I've always wondered about that. LLM providers could easily decimate the cost of inference if they got the models to just stop emitting so much hot air. I don't understand why OpenAI wants to pay 3x the cost to generate a response when two thirds of those tokens are meaningless noise.

▲

ben_w 2 hours ago | parent | next [-]

Because they don't yet know how to "just stop emitting so much hot air" without also removing their ability to do anything like "thinking" (or whatever you want to call the transcript mode), which is hard because knowing which tokens are hot air is the hard problem itself.

They basically only started doing this because someone noticed you got better performance from the early models by straight up writing "think step by step" in your prompt.

▲

Terr_ 2 hours ago | parent [-]

IMO it supports the framing that it's all just a "make document longer" problem, where our human brains are primed for a kind of illusion, where we perceive/infer a mind because, traditionally, that's been the only thing that makes such fitting language.

	▲	ben_w 2 hours ago \| parent [-]
		To an extent. Even though they're clearly improving, they also definitely look better than they actually are. this time last year they couldn't write compilable source code for a compiler for a toy language, I know because I tried

▲

observationist an hour ago | parent | prev | next [-]

This is an active research topic - two papers on this have come out over the last few days, one cutting half of the tokens and actually boosting performance overall.

I'd hazard a guess that they could get another 40% reduction, if they can come up with better reasoning scaffolding.

Each advance over the last 4 years, from RLHF to o1 reasoning to multi-agent, multi-cluster parallelized CoT, has resulted in a new engineering scope, and the low hanging fruit in each place gets explored over the course of 8-12 months. We still probably have a year or 2 of low hanging fruit and hacking on everything htat makes up current frontier models.

It'll be interesting if there's any architectural upsets in the near future. All the money and time invested into transformers could get ditched in favor of some other new king of the hill(climbers).

https://arxiv.org/abs/2602.02828 https://arxiv.org/abs/2503.16419 https://arxiv.org/abs/2508.05988

Current LLMs are going to get really sleek and highly tuned, but I have a feeling they're going to be relegated to a component status, or maybe even abandoned when the next best thing comes along and blows the performance away.

▲

sambaumann 2 hours ago | parent | prev | next [-]

because for API users they get to charge for 3x the tokens for the same requests

▲

CamperBob2 2 hours ago | parent | prev [-]

The 'hot air' is apparently more important than it appears at first, because those initial tokens are the substrate that the transformer uses for computation. Karpathy talks a little about this in some of his introductory lectures on YouTube.

	▲	Terr_ 2 hours ago \| parent [-]
		Related are "reasoning" models, where there's a stream of "hot air" that's not being shown to the end-user. I analogize it as a film noir script document: The hardboiled detective character has unspoken text, and if you ask some agent to "make this document longer", there's extra continuity to work with.