TurboQuant: A First-Principles Walkthrough

amitport 21 minutes ago | parent | next [-]

TurboQuant IS restricted EDEN quantization (NeurIPS 21, ICML 22). Missing the optimal scale derivations, which makes TurboQuant variant considerbly LESS accurate than those works. We show this throughfully in a new note in https://arxiv.org/abs/2604.18555

We were the first to introduce post-rotation distribution aware quantization in 21, which was LATER implemented in many fields including federated learning, vector retrieval, databases, inference engines, and KV-cache.

It would be nice to get some credit of this. And it is certainly baffling to see the name "TurboQuant" repeated in this context, considering the many works from 21 onwards.

The blog post above basically goes you through EDEN quantization, but then ends settling with a less than optimal MSE-minimizing version and an unbiasing trick that often costs a full bit more than DRIVE/EDEN need for the same results (with the unbiasing scale, shown in the original 21 paper).

▲

jarbus 4 minutes ago | parent | prev | next [-]

This is incredible. Interactive demos like this make mathematics 10x more accessible

▲

linuxhansl an hour ago | parent | prev [-]

I am fascinated by this and similar research (RotorQuant, etc). It seem by next year we will be able to run this year's largest models on last year's hardware. :)

Maybe we won't need as many data centers and as much power as we thought. Maybe we can run more powerful models locally.

	▲	everythingctl an hour ago \| parent [-]
		Maybe we can run more powerful models locally. I thought the principal consequence of these KV cache optimisations was letting you run more simultaneous inferences on the same model with the same memory. It doesn’t let you store more model. In some sense that puts local LLM usage at a further disadvantage to inference done in a hyperscaler’s data center.