I'm a bit surprised the article makes no mention of Google's TurboQuant[0] introduced 26 days prior.

Given that TurboQuant results in a 6x reduction in memory usage for KV caches and up to 8x boost in speed, this optimization is already showing up in llama.cpp, enabling significantly bigger contexts without having to run a smaller model to fit it all in memory.

Some people thought it might significantly improve the RAM situation, though I remain a bit skeptical - the demand is probably still larger than the reduction turboquant brings.

[0] https://news.ycombinator.com/item?id=47513475

▲

gajjanag 11 hours ago | parent | next [-]

TurboQuant is known across the industry to not be state of the art. There are superior schemes for KV quant at every bitrate. Eg, SpectralQuant: https://github.com/Dynamis-Labs/spectralquant among many, many papers.

> Given that TurboQuant results in a 6x reduction in memory usage for KV caches

All depends on baseline. The "6x" is by stylistic comparison to a BF16 KV cache; not a state of the art 8 or 4 bit KV cache scheme.

▲

lhl 15 hours ago | parent | prev | next [-]

BTW, a number of corrections. The TurboQuant paper was submitted to Arxiv back in April 2025: https://arxiv.org/abs/2504.19874

Current "TurboQuant" implementations are about 3.8X-4.9X on compression (w/ the higher end taking some significant hits of GSM8K performance) and with about 80-100% baseline speed (no improvement, regression): https://github.com/vllm-project/vllm/pull/38479

For those not paying attention, it's probably worth sending this and ongoing discussion for vLLM https://github.com/vllm-project/vllm/issues/38171 and llama.cpp through your summarizer of choice - TurboQuant is fine, but not a magic bullet. Personally, I've been experimenting with DMS and I think it has a lot more promise and can be stacked with various quantization schemes.

The biggest savings in kvcache though is in improved model architecture. Gemma 4's SWA/global hybrid saves up to 10X kvcache, MLA/DSA (the latter that helps solve global attention compute) does as well, and using linear, SSM layers saves even more.

None of these reduce memory demand (Jevon's paradox, etc), though. Looking at my coding tools, I'm using about 10-15B cached tokens/mo currently (was 5-8B a couple months ago) and while I think I'm probably above average on the curve, I don't consider myself doing anything especially crazy and this year, between mainstream developers, and more and more agents, I don't think there's really any limit to the number of tokens that people will want to consume.

▲

fy20 14 hours ago | parent | prev | next [-]

The work going into local models seems to be targeting lower RAM/VRAM which will definately help.

For example Gemma 4 32B, which you can run on an off-the-shelf laptop, is around the same or even higher intelligence level as the SOTA models from 2 years ago (e.g. gpt-4o). Probably by the time memory prices come down we will have something as smart as Opus 4.7 that can be run locally.

Bigger models of course have more embedded knowledge, but just knowing that they should make a tool call to do a web search can bypass a lot of that.

▲

tuetuopay 15 hours ago | parent | prev | next [-]

The net effect won’t be a memory use reduction to achieve the same thing. We’ll do more with the same amount of memory. Companies will increase the context windows of their offerings and people will use it.

That is the sad reality of the future of memory.

▲

ehnto 14 hours ago | parent [-]

I am not convinced that more context will be useful, practical use of current models at 1mil context window shows they get less effective as the window grows. Given model progress is slowing as well, perhaps we end up reaching a balance of context size and competency sooner than expected.

	▲	tuetuopay 13 hours ago \| parent [-]
		Stuff in more code. Stuff in more system prompt. Stuff in raw utf8 characters instead of tokens to fix strawberries. Stuff in WAY more reasoning steps. Given the current tech, I also doubt there will be practical uses and I hope we’ll see the opposite of what I wrote. But given the current industry, I fully trust them so somehow fill their hardware. Market history shows us than when the cost of something goes down, we do more with the same amount, not the same thing with less. But I deeply hope to be wrong here and the memory market will relax.

▲

Bombthecat 16 hours ago | parent | prev | next [-]

You still need to hold the model in memory. If you have for example 16 GB ram, the gains aren't that much

	▲	anon373839 15 hours ago \| parent [-]
		That's not what consumes the most memory at scale. The KV caches are per-user.

▲

WesolyKubeczek 16 hours ago | parent | prev | next [-]

You can still use as much memory, but fit more things into it, so I don’t think the current market hogs will let go easily.

▲

muyuu 13 hours ago | parent | prev | next [-]

that will only increase the demand for RAM as models will now be usable in scenarios that weren't feasible prior, and the ceiling for model and context size is not even visible at this point

I hate to mention Jevons paradox as it has become cliche by now, but this is a textbook such scenario

▲

throwaway613746 4 hours ago | parent | prev | next [-]

[dead]

▲

WingEdge777 16 hours ago | parent | prev [-]

[dead]