This is a great development for KV cache compression. I did notice a missing citation in the related works regarding the core mathematical mechanism, though. The foundational technique of applying a geometric rotation prior to extreme quantization, specifically for managing the high-dimensional geometry and enabling proper bias correction, was introduced in our NeurIPS 2021 paper, "DRIVE" (https://proceedings.neurips.cc/paper/2021/hash/0397758f8990c...). We used this exact rotational approach and a similar bias correction mechanism to achieve optimal distributed mean estimation. I also presented this work and subsequent papers in a private invited talk at Google shortly after publication. Given the strong theoretical overlap with the mechanisms in TurboQuant and PolarQuant, I hope to see this prior art acknowledged in the upcoming camera-ready versions.

▲

eecc 4 hours ago | parent | next [-]

Pardon my simplistic question, but when you mean rotation you’re essentially talking about diagonalization aren’t you?

So storing the diagonal as a matrix and the new bases is more compact?

▲

amitport 3 hours ago | parent [-]

In this context, the rotation is for spreading energy and ensuring predictable coordinate distributions rather than diagonalization; it makes coordinate-wise quantization much more computationally efficient, though it throws away learnable structure.

	▲	eecc an hour ago \| parent [-]
		ah ok, so intuitively it's like minimizing the error when replacing the values with a well-known distribution. So all you need to carry along is the rotation and the assumption that there is some amount of loss.

▲

jmalicki 2 hours ago | parent | prev | next [-]

If they didn't cite your paper that's bullshit.

But if they read your paper enough that they invited you to a talk, that probably means they were far enough along to independently inventing it they were going to do so anyway, and wanted to chat with someone who was also doing the thing they were already doing. Good ideas tend to reveal themselves to anyone who is aware of the problem.

▲

amitport an hour ago | parent | next [-]

To be clear, I am not claiming they stole an idea. They have made significant independent research. However, a specific part regarding the treatment of rotation with bias correction relates to prior work, and it would be appropriate to have that recognized.

▲

efavdb an hour ago | parent | prev | next [-]

The earlier paper was from 2021!

▲

ekjhgkejhgk 2 hours ago | parent | prev | next [-]

Doesn't matter, you should still cite. It's basic manners in science.

	▲	kleiba 2 hours ago \| parent [-]
		Exactly, that's why the section is called "Related Work".

▲

cubefox 2 hours ago | parent | prev [-]

> But if they read your paper enough that they invited you to a talk, that probably means they were far enough along to independently inventing it

That's more than a stretch. They likely invented them because someone thought the abstract sounded interesting, or something like that.

▲

busfahrer 5 hours ago | parent | prev | next [-]

I just today learned about Multi-Head Latent Attention, which is also sort of a way of compressing the KV cache. Can someone explain how this new development relates to MHLA?

	▲	yorwba 4 hours ago \| parent [-]
		Multi-Head Latent attention is a redesigned attention mechanism that produces lower-dimensional KV-cache entries. Vector quantization can store KV-cache entries using a small number of bits per dimension while ensuring that the resulting attention scores don't change too much. So MLA needs to be part of the model from the beginning of training, whereas VQ can be retrofitted afterwards, and you could also combine the two.

▲

sva_ 3 hours ago | parent | prev [-]

Schmidhuber'd