▲ | mysteria 6 days ago | |||||||
Hold on, you're only getting 45 tokens/sec with Mistral 7B on a 5090 of all things? That gets ~240 tokens/sec with Llama 7B quantized to 4 bits on llama.cpp [1] and those models should be pretty similar architecturally. I don't know exactly how the scaling works here but considering how LLM inference is memory bandwidth limited you should go beyond 100 tokens/sec with the same model and a 8 bit quantization. | ||||||||
▲ | Sohcahtoa82 6 days ago | parent [-] | |||||||
My understanding is that quantizing lowers memory usage but increases compute usage because it still needs to convert the weights to fp16 on the fly at inference time. Clearly I'm doing something wrong if it's a net loss in performance for me. I might have to look more into this. | ||||||||
|