| ▲ | TurboQuant: Redefining AI efficiency with extreme compression(research.google) |
| 292 points by ray__ 9 hours ago | 92 comments |
| |
|
| ▲ | amitport 6 hours ago | parent | next [-] |
| This is a great development for KV cache compression. I did notice a missing citation in the related works regarding the core mathematical mechanism, though. The foundational technique of applying a geometric rotation prior to extreme quantization, specifically for managing the high-dimensional geometry and enabling proper bias correction, was introduced in our NeurIPS 2021 paper, "DRIVE" (https://proceedings.neurips.cc/paper/2021/hash/0397758f8990c...). We used this exact rotational approach and a similar bias correction mechanism to achieve optimal distributed mean estimation. I also presented this work and subsequent papers in a private invited talk at Google shortly after publication. Given the strong theoretical overlap with the mechanisms in TurboQuant and PolarQuant, I hope to see this prior art acknowledged in the upcoming camera-ready versions. |
| |
| ▲ | sva_ an hour ago | parent | next [-] | | Schmidhuber'd | |
| ▲ | jmalicki an hour ago | parent | prev | next [-] | | If they didn't cite your paper that's bullshit. But if they read your paper enough that they invited you to a talk, that probably means they were far enough along to independently inventing it they were going to do so anyway, and wanted to chat with someone who was also doing the thing they were already doing. Good ideas tend to reveal themselves to anyone who is aware of the problem. | | | |
| ▲ | eecc 2 hours ago | parent | prev | next [-] | | Pardon my simplistic question, but when you mean rotation you’re essentially talking about diagonalization aren’t you? So storing the diagonal as a matrix and the new bases is more compact? | | |
| ▲ | amitport an hour ago | parent [-] | | In this context, the rotation is for spreading energy and ensuring predictable coordinate distributions rather than diagonalization; it makes coordinate-wise quantization much more computationally efficient, though it throws away learnable structure. |
| |
| ▲ | busfahrer 3 hours ago | parent | prev [-] | | I just today learned about Multi-Head Latent Attention, which is also sort of a way of compressing the KV cache. Can someone explain how this new development relates to MHLA? | | |
| ▲ | yorwba 2 hours ago | parent [-] | | Multi-Head Latent attention is a redesigned attention mechanism that produces lower-dimensional KV-cache entries. Vector quantization can store KV-cache entries using a small number of bits per dimension while ensuring that the resulting attention scores don't change too much. So MLA needs to be part of the model from the beginning of training, whereas VQ can be retrofitted afterwards, and you could also combine the two. |
|
|
|
| ▲ | gavinray 35 minutes ago | parent | prev | next [-] |
| Can someone ELI5 these two concepts please, which make no sense to me: > "TurboQuant starts by randomly rotating the data vectors. This clever step simplifies the data's geometry"
I don't understand how taking a series of data and applying a random rotation could mathemetically lead every time to "simpler" geometry.If I throw a bunch of shapes on the ground, tightly packed and touching each other, then rotate all of them, you can't guarantee that the new conglomerate shape is any more/less "simple" than before, right? > "Johnson-Lindenstrauss Transform to shrink complex, high-dimensional data while preserving the essential distances and relationships between data points. It reduces each resulting vector number to a single sign bit (+1 or -1)."
How can a boolean value preserve all of the relational and positional information between data points? |
| |
| ▲ | lumost 15 minutes ago | parent | next [-] | | They are saying that models should be invariant to data's orientation - and only sensitive to the distance between vectors. This has a pretty significant effect on reducing the set of possible models, and may stabilize the optimization. In simple terms, large ML models like LLMs often learn trivial rules such as "if the 21st decimal place of the 5th dimension in the embedding vector is 5 - then the image is of a cat." Learning such a memorization function is usually not what we are trying to do, and there are a variety of techniques to avoid these trivial solutions and "smooth" the optimization geometry. | |
| ▲ | wordpad 18 minutes ago | parent | prev [-] | | They are not doing random rotation, simplification here means they are aligning the outliers. If you threw a bunch of shapes on the ground they are picking up one that rolled away and putting it with the others. >How can a boolean value preserve all of the relational and positional information between data points? They aren't reducing entire vector to a bollean only each of its dimensions. |
|
|
| ▲ | akhenakh an hour ago | parent | prev | next [-] |
| Someone implementing it on llamacpp already https://github.com/mudler/llama.cpp/commit/dee102db1bfd723c9... |
| |
| ▲ | cpburns2009 an hour ago | parent [-] | | For some reason I thought the implementation would be way more complicated than that. I obviously lack the domain knowledge to tackle something like this, but it looks straight forward. |
|
|
| ▲ | mmastrac 23 minutes ago | parent | prev | next [-] |
| Is this a tradeoff between GPU-computation-expense vs accuracy? ie: you could quantize into segments or grids on the unit circle/sphere/etc, but that's too expensive so it's better to just quantize to a Cartesian grid because the GPU can decompress cheaper? |
|
| ▲ | pstoll 2 hours ago | parent | prev | next [-] |
| And a group has published an independent working implementation today, nice to see: https://github.com/tonbistudio/turboquant-pytorch |
|
| ▲ | benob 7 hours ago | parent | prev | next [-] |
| This is the worst lay-people explanation of an AI component I have seen in a long time. It doesn't even seem AI generated. |
| |
| ▲ | BenoitP 6 hours ago | parent | next [-] | | It is AI generated. Or was written by someone a bit far from the technical advances IMHO. The Johnson-Lindenstrauss Lemma is a very specific and powerful concept, when in the article the QLJ explanation is vacuous. A knowledgeable human would not have left the reader wanting for how that relates to the Lemma. | |
| ▲ | spencerflem 7 hours ago | parent | prev [-] | | I think it is though- “ TurboQuant, QJL, and PolarQuant are more than just practical engineering solutions; they’re fundamental algorithmic contributions backed by strong theoretical proofs. These methods don't just work well in real-world applications; they are provably efficient and operate near theoretical lower bounds.” | | |
| ▲ | zarzavat 4 hours ago | parent | next [-] | | I read "this clever step" and immediately came to the comments to see if anyone picked up on it. It reads like a pop science article while at the same time being way too technical to be a pop science article. Turing test ain't dead yet. | | |
| ▲ | TeMPOraL 6 minutes ago | parent [-] | | > Turing test ain't dead yet. Only because people are lazy, and don't bother with a simple post-processing step: attach a bunch of documents or text snippets written by a human (whether yourself or, say, some respected but stylistically boring author), and ask the LLM to match style/tone. |
| |
| ▲ | NoahZuniga 4 hours ago | parent | prev | next [-] | | Genius new idea: replace the em-dashes with semicolons so it looks less like AI. | | |
| ▲ | tux3 3 hours ago | parent [-] | | You're absolutely right. That's not just a genius idea; it's a radical new paradigm. |
| |
| ▲ | integralid 6 hours ago | parent | prev | next [-] | | I also instinctively reacted to that fragment, but at this point I think this is overreacting to a single expression. It's not just a normal thing to say in English, it's something people have been saying for a long time before LLMs existed. | | |
| ▲ | nvme0n1p1 6 hours ago | parent | next [-] | | There are tells all over the page: > Redefining AI efficiency with extreme compression "Redefine" is a favorite word of AI. Honestly no need to read further. > the key-value cache, a high-speed "digital cheat sheet" that stores frequently used information under simple labels No competent engineer would describe a cache as a "cheat sheet". Cheat sheets are static, but caches dynamically update during execution. Students don't rewrite their cheat sheets during the test, do they? LLMs love their inaccurate metaphors. > QJL: The zero-overhead, 1-bit trick > It reduces each resulting vector number to a single sign bit (+1 or -1). This algorithm essentially creates a high-speed shorthand that requires zero memory overhead. Why does it keep emphasizing zero overhead? Why is storing a single bit a "trick?" Either there's currently an epidemic of algorithms that use more than one bit to store a bit, or the AI is shoving in extra plausible-sounding words to pad things out. You decide which is more likely. It's 1:30am and I can't sleep, and I still regret wasting my time on this slop. | | |
| ▲ | veunes 5 hours ago | parent | next [-] | | Looks like Google canned all their tech writers just to pivot the budget into H100s for training these very same writers | | | |
| ▲ | roywiggins 25 minutes ago | parent | prev | next [-] | | "The X Trick" or "The Y Dilemma" or similar snowclones in a header is also a big AI thing. Humans use this construction too, but LLMs love it out of all proportion. I call it The Ludlum Delusion (since that's how every Robert Ludlum book is titled). | |
| ▲ | pqs 5 hours ago | parent | prev [-] | | There is also the possibility that the article when through the hands of the company's communication department which has writers that probably write at LLM level. | | |
| |
| ▲ | g-mork 4 hours ago | parent | prev [-] | | Another instinctual reaction here. This specific formulation pops out of AI all the time, there might as well have been an emdash in the title |
| |
| ▲ | benob 7 hours ago | parent | prev [-] | | Maybe they quantized a bit too much the model parameters... |
|
|
|
| ▲ | iddan an hour ago | parent | prev | next [-] |
| I am guessing as Google is vertically integrated and "actually pays" for AI infra (compared to OpenAI & Anthropic that receives hardware as partnerships) they have a more urgent incentive to reduce model sizes. Also, Google and Apple will be the first to gain from running model on-device |
| |
| ▲ | mrcwinn an hour ago | parent [-] | | I can assure you OpenAI and Anthropic pay for hardware. They don’t receive it for free. |
|
|
| ▲ | bilsbie an hour ago | parent | prev | next [-] |
| It seems like most breakthroughs I see are for efficiency? What are the most importsnt breakthroughs from the past two or three years for intelligence? |
| |
| ▲ | Lerc 37 minutes ago | parent | next [-] | | If you think of it from the point of view of the universal approximation theorem, it's all efficiency optimisation. We know that it works if we do it incredibly inefficiently. Every architecture improvement is essentially a way to achieve the capability of a single fully-connected hidden layer network n wide. With fewer parameters. Given these architectures usually still contain fully connected layers, unless they've done something really wrong, they should still be able to do anything if you make the entire thing large enough. That means a large enough [insert model architecture] will be able to approximate any function to arbitrary precision. As long as the efficiency gains with the architecture are retained as the scale increases they should be able to get there quicker. | |
| ▲ | ertgbnm an hour ago | parent | prev | next [-] | | Most breakthroughs that are published are for efficiency because most breakthroughs that are published are for open source.' All the foundation model breakthroughs are hoarded by the labs doing the pretraining. That being said, RL reasoning training is the obvious and largest breakthrough for intelligence in recent years. | |
| ▲ | irthomasthomas an hour ago | parent | prev [-] | | Efficiency gains can be used to make existing models more profitable, or to make new larger and more intelligent models. |
|
|
| ▲ | vaildegraff 2 hours ago | parent | prev | next [-] |
| The accuracy preservation is impressive, but I'd want to see adversarial evaluation after quantization - not just benchmark scores. Compressed models can behave identically on clean inputs while diverging on edge cases. If your safety-critical behavior lives in the long tail of the distribution, a quantizer that rounds to the nearest centroid might round away your guardrails. Nobody publishes those numbers because nobody wants to find out. |
| |
| ▲ | hellcow 2 hours ago | parent [-] | | LLM slop. See their other comment which is even more obvious. | | |
| ▲ | 3 minutes ago | parent | next [-] | | [deleted] | |
| ▲ | vlovich123 an hour ago | parent | prev [-] | | They only have one comment on this site unless it was deleted… | | |
| ▲ | vidarh an hour ago | parent [-] | | They have several, but the others won't show unless you have showdead turned on, as they've already been flagged. |
|
|
|
|
| ▲ | ssijak 3 hours ago | parent | prev | next [-] |
| For my grug brain can somebody translate this to ELIgrug terms? Does this mean I would be able to run 500b model on my 48gb macbook without loosing quality? |
| |
| ▲ | x_may 2 hours ago | parent [-] | | KV cache compression, so how much memory the model needs to use for extending its context. Does not affect the weight size. |
|
|
| ▲ | bluequbit 7 hours ago | parent | prev | next [-] |
| I did not understand what polarQuant is. Is is something like pattern based compression where the algorithm finds repeating patterns and creates an index of those common symbols or numbers? |
| |
| ▲ | Maxious 7 hours ago | parent | next [-] | | https://mesuvash.github.io/blog/2026/turboquant-interactive/ has a little visualisation | | |
| ▲ | pstoll 2 hours ago | parent | next [-] | | Good post but link at the end is broken. “””
For the full technical explanation with equations, proofs, and PyTorch pseudocode, see the companion post: TurboQuant: Near-Optimal Vector Quantization Without Looking at Your Data.“ | |
| ▲ | spencerflem 7 hours ago | parent | prev [-] | | I like the visualization, but I don’t understand the grid quantization. If every point is on the unit circle aren’t all the center grid cords unused? | | |
| ▲ | fc417fc802 an hour ago | parent | next [-] | | Yeah that's odd. It seems like you'd want an n-1 dimensional grid on the surface of the unit sphere rather than an n dimensional grid within which the sphere resides. Looking at the paper (https://arxiv.org/abs/2504.19874) they cite earlier work that does exactly that. They object that grid projection and binary search perform exceptionally poorly on the GPU. I don't think they're using a regular grid as depicted on the linked page. Equation 4 from the paper is how they compute centroids for the MSE optimal quantizer. Why specify MSE optimal you ask? Yeah so it turns out there's actually two quantization steps, a detail also omitted from the linked page. They apply QJL quantization to the residual of the grid quantized data. My description is almost certainly missing key details; I'm not great at math and this is sufficiently dense to be a slog. | |
| ▲ | vincnetas 6 hours ago | parent | prev [-] | | i think grid can be a surface of the unit sphere |
|
| |
| ▲ | mrugge 7 hours ago | parent | prev | next [-] | | 1. Efficient recursive transform of kv embeddings into polar coordinates
2. Quantize resulting angles without the need for explicit normalization. This saves memory via key insight: angles follow a distribution and have analytical form. | | | |
| ▲ | viktorcode 5 hours ago | parent | prev [-] | | The way I understand it, it's a way of compressing vectors by switching from their per-component representation to polar coordinates representation, where the nearby vectors are clumped together to a single line, allowing to describe them by different lengths |
|
|
| ▲ | macleginn 3 hours ago | parent | prev | next [-] |
| "TurboQuant proved it can quantize the key-value cache to just 3 bits without requiring training or fine-tuning and causing any compromise in model accuracy" -- what do each 3 bits correspond to? Hardly individual keys or values, since it would limit each of them to 8 different vectors. |
| |
|
| ▲ | zeeshana07x 4 hours ago | parent | prev | next [-] |
| The gap between how this is described in the paper vs the blog post is pretty wide. Would be nice to see more accessible writing from research teams — not everyone reading is a ML engineer |
| |
| ▲ | om8 4 hours ago | parent | next [-] | | These are very different media types with very different goals. | |
| ▲ | dev_tools_lab 4 hours ago | parent | prev [-] | | Agreed. The practical implications are often
more interesting than the math anyway — smaller
models running locally means you can afford to
run multiple models in parallel for cross-validation,
which changes how you approach tasks like code
analysis or bug detection. |
|
|
| ▲ | lwhi an hour ago | parent | prev | next [-] |
| Will this help us run models locally? |
|
| ▲ | naasking 20 minutes ago | parent | prev | next [-] |
| This sounds great! TurboQuant does KV cache compression using quantization via rotations, and ParoQuant [1] does weight compression using quantization via rotations! So we can get 4-bit weights that match bf16 precision, the KV cache goes down to 3 bits per key. This brings larger models and long contexts into the range of "possibly runnable" on beefy consumer hardware. [1] https://github.com/z-lab/paroquant |
|
| ▲ | maurelius2 6 hours ago | parent | prev | next [-] |
| I'm somewhat at a loss here other than understanding the fundamentals. Can someone tell me how the compression impact performance? |
| |
| ▲ | dryarzeg 6 hours ago | parent | next [-] | | If in short, for many inference tasks the bottleneck is memory bandwidth. Suppose you have a machine with a memory bandwidth of 256 GB/s, and let's say you want to do inference for 4B model (model with 4 billion parameters). If you will load the model in BF16 format (16 bits), each forward pass (i.e. each token generated) will require roughly ~8 GB of memory bandwidth. So, 256/8 = 32 t/s, and that's the generation speed you will be strictly capped at even if your processing power is measured in exaFLOPS. But let's say now that you have decided to instead quantize the model and then run the quantized version. Suppose you have made a Q4_K_M version (4 bits + some weights will take more). Now each of your forward passes will take roughly 2-3 GB (rough approximations, reality is different) of memory bandwith (actually, it will be around 2 GB), and even in the worst case 256/3 = 85.3, while 256/2 = 128 t/s. Quants can reduce quality of the model and lower it's performance, but in most modern quantization methods those losses are usually negligible (although, of course, they're still present). So, as you can see, it can be concluded that quantization "widens" (it's not removing it fully) memory bottleneck while still preserving (not always though) acceptable quality. (Sorry for my terrible English, it's not my native language) | |
| ▲ | valine 6 hours ago | parent | prev [-] | | So let’s start with a really simple decoder transformer with a single layer and single attention head, and train it to predict the next token in a sequence of text. To predict the next token you need a few things: a query for the very last token in the sequence, and a key and value for every prior token. You take your query and compute a dot product with every prior key (two large vectors in, scaler attention score out). That scaler attention score first goes through softmax, and then becomes the weight you use to compute a weighted average of your values, new value goes through the mlp, mlp output is projected into the logits from which you sample your next token (that’s the general idea at least skipped a few steps). The last query in the sequence will be new for every new token you predict, but the set of prior keys and values stay the same, ie keys and values are reusable. The key value cache gets bigger and bigger for each new token you add to the sequence, and that’s where compression comes in. You have to store the keys and values in vram, and you’d like to keep the size down by not storing the raw uncompressed tensors. To make this work well your compression needs two things: it needs to be fast so that you can compress and decompress on the fly, and it needs to play well with softmax attention. Prior attempts at compression usually suck at one or the other, either the speed to decompress is too slow and your token/s takes a hit, or you lose important precision and the model output quality suffers. The claim in the paper is that they’ve made progress on both. | | |
| ▲ | edg5000 6 hours ago | parent [-] | | So limiting max context length also reduces VRAM needs a bit? If cache is 20% of total, 1/10th of context as a limit would mean 18% total memory reduction. | | |
| ▲ | valine 5 hours ago | parent [-] | | Yup exactly, in principle it helps with both inference speed by reducing memory bandwidth usage and also reduces the memory footprint of your kvcache. |
|
|
|
|
| ▲ | _s_a_m_ an hour ago | parent | prev | next [-] |
| has the word "advanced", gotta be good |
|
| ▲ | moktonar 7 hours ago | parent | prev | next [-] |
| Aren’t polar coordinates still n-1 + 1 for radius for n-dim vector? If so I understand that angles can be quantized better but when radius r is big the error is large for highly quantized angles right? What am I missing? |
| |
| ▲ | amitport 7 hours ago | parent [-] | | r is a single value per vector. You don't have to quantize it, you can keep it and quantize the billion+ other coordinates of the vector. | | |
| ▲ | mungoman2 6 hours ago | parent [-] | | What they're saying is that the error for a vector increases with r, which is true. Trivially, with r=0, the error is 0, regardless of how heavily the direction is quantized. Larger r means larger absolute error in the reconstructed vector. | | |
| ▲ | amitport 6 hours ago | parent [-] | | Yes, the important part is that the normalized error does not increase with the dimension of the vector (which does happen when using biased quantizers) It is expected that bigger vectors have proportionally bigger error, nothing can be done by the quantizer about that. |
|
|
|
|
| ▲ | lucrbvi 5 hours ago | parent | prev | next [-] |
| Sounds like Multi-Head Latent Attention (MLA) from DeepSeek |
| |
| ▲ | veunes 4 hours ago | parent [-] | | Nah, those are completely different beasts. DeepSeek's MLA solves the KV cache issue via low-rank projection - they literally squeeze the matrix through a latent vector at train time. TurboQuant is just Post-Training Quantization where they mathematically compress existing weights and activations using polar coordinates | | |
|
|
| ▲ | wei03288 27 minutes ago | parent | prev | next [-] |
| [dead] |
|
| ▲ | leontloveless 2 hours ago | parent | prev | next [-] |
| [dead] |
|
| ▲ | pugchat 3 hours ago | parent | prev | next [-] |
| [dead] |
|
| ▲ | paxrel_ai an hour ago | parent | prev | next [-] |
| [dead] |
|
| ▲ | veunes 5 hours ago | parent | prev | next [-] |
| [dead] |
|
| ▲ | aledevv 5 hours ago | parent | prev | next [-] |
| [dead] |
|
| ▲ | rsmtjohn 6 hours ago | parent | prev | next [-] |
| [dead] |
|
| ▲ | mohsen1 6 hours ago | parent | prev | next [-] |
| [dead] |
|
| ▲ | hikaru_ai 7 hours ago | parent | prev | next [-] |
| [dead] |
|
| ▲ | dev_tools_lab 4 hours ago | parent | prev | next [-] |
| [dead] |
|
| ▲ | mskkm 5 hours ago | parent | prev [-] |
| Pied Piper vibes. As far as I can tell, this algorithm is hardly compatible with modern GPU architectures. My guess is that’s why the paper reports accuracy-vs-space, but conveniently avoids reporting inference wall-clock time. The baseline numbers also look seriously underreported. “several orders of magnitude” speedups for vector search? Really? anyone has actually reproduced these results? |
| |
| ▲ | fc417fc802 an hour ago | parent | next [-] | | Efficient execution on the GPU appears to have been one of the specific aims of the authors. Table 2 of their paper shows real world performance that would appear at a glance to be compatible with inference. | | |
| ▲ | mskkm 28 minutes ago | parent [-] | | This is not an LLM inference result. Table 2 is the part I find most questionable. Claiming orders-of-magnitude improvements in vector search over standard methods is an extraordinary claim. If it actually held up in practice, I would have expected to see independent reproductions or real-world adoption by now. It’s been about a year since the paper came out, and I haven’t seen much of either. That doesn’t prove the claim is false, but it certainly doesn’t inspire confidence. |
| |
| ▲ | NitpickLawyer 4 hours ago | parent | prev | next [-] | | Apparently MLX confirmed it - https://x.com/prince_canuma/status/2036611007523512397 | | |
| ▲ | mskkm 3 hours ago | parent [-] | | They confirmed on the accuracy on NIAH but didn't reproduce the claimed 8x efficiency. |
| |
| ▲ | veunes 4 hours ago | parent | prev [-] | | Classic academic move. If the authors show accuracy-vs-space charts but hide end-to-end latency, it usually means their code is slower in practice than vanilla fp16 without any compression. Polar coordinates are absolute poison for parallel GPU compute | | |
|