▲ | Sohcahtoa82 6 days ago | |
My understanding is that quantizing lowers memory usage but increases compute usage because it still needs to convert the weights to fp16 on the fly at inference time. Clearly I'm doing something wrong if it's a net loss in performance for me. I might have to look more into this. | ||
▲ | mysteria 6 days ago | parent [-] | |
Yes it increases compute usage but your 5090 has a hell of a lot of compute and the decompression algorithms are pretty simple. Memory is the bottleneck here and unless you have a strange GPU which has lots of fast memory but very weak compute a quantized model should always run faster. If you're using llama.cpp run the benchmark in the link I posted earlier and see what you get; I think there's something like it for vllm as well. |