▲ | mysteria 6 days ago | |
Yes it increases compute usage but your 5090 has a hell of a lot of compute and the decompression algorithms are pretty simple. Memory is the bottleneck here and unless you have a strange GPU which has lots of fast memory but very weak compute a quantized model should always run faster. If you're using llama.cpp run the benchmark in the link I posted earlier and see what you get; I think there's something like it for vllm as well. |