Remix.run Logo
mysteria 6 days ago

Yes it increases compute usage but your 5090 has a hell of a lot of compute and the decompression algorithms are pretty simple. Memory is the bottleneck here and unless you have a strange GPU which has lots of fast memory but very weak compute a quantized model should always run faster.

If you're using llama.cpp run the benchmark in the link I posted earlier and see what you get; I think there's something like it for vllm as well.