I am fascinated by this and similar research (RotorQuant, etc). It seem by next year we will be able to run this year's largest models on last year's hardware. :)

Maybe we won't need as many data centers and as much power as we thought. Maybe we can run more powerful models locally.

▲

qingcharles 19 minutes ago | parent | next [-]

We're only a few years into this new tech getting serious research manhours thrown at it. Already some incredible optimizations have been found in a short amount of time. Not only has the efficiency of inference been increasing dramatically, the quality of tiny models has been significantly improving.

The future is bright for local AI.

▲

everythingctl 2 hours ago | parent | prev [-]

Maybe we can run more powerful models locally.

I thought the principal consequence of these KV cache optimisations was letting you run more simultaneous inferences on the same model with the same memory. It doesn’t let you store more model. In some sense that puts local LLM usage at a further disadvantage to inference done in a hyperscaler’s data center.

	▲	linuxhansl an hour ago \| parent [-]
		The size of the KV cache (context stored) is proportional to the number of layers of the model and number of "hidden dimensions". For a 400B model it could be 30-60GB for just an 8K context window (depends on the model, etc, just a ballpark). So shrinking that by 6x (from fp16), would be big win for larger models. True, while TurboQuant can also be applied to model weights, it won't save size over q4 compression, but will have better accuracy. Edits: Better context