Remix.run Logo
suprjami 12 hours ago

Some models really suffer badly from KV quantisation. You can also take a speed hit using dissimilar K and V types.

TurboQuant seems to be the next big thing in context memory usage. Polar coordinates achieving ~5x reduction in memory usage with minimal/no quality loss, and even a slight speedup in some cases.

LuxBennu 34 minutes ago | parent | next [-]

yeah fair point, it's definitely model dependent. i've had good results with qwen but tried it on a smaller mistral variant once and the output quality dropped noticeably even at q8 for both. the speed hit from mixed types hasn't been bad on apple silicon in my experience but i can see it mattering more on cuda.

Ecko123 2 hours ago | parent | prev [-]

[dead]