FWIW, my slowness is because of quantizing.

I've been using Mistral 7B, and I can get 45 tokens/sec, which is PLENTY fast, but to save VRAM so I can game while doing inference (I run an IRC bot that allows people to talk to Mistral), I quantize to 8 bits, which then brings my inference speed down to ~8 tokens/sec.

For gaming, I absolutely love this card. I can play Cyberpunk 2077 with all the graphics settings set to the maximum and get 120+ fps. Though when playing a much more graphically intense game like that, I certainly need to kill the bot to free up the VRAM. But I can play something simpler like League of Legends and have inference happening while I play with zero impact on game performance.

I also have 128 GB of system RAM. I've thought about loading the model in both 8-bit and 16-bit into system RAM and just swap which one is in VRAM based on if I'm playing a game so that if I'm not playing something, the bot runs significantly faster.

▲

mysteria 6 days ago | parent [-]

Hold on, you're only getting 45 tokens/sec with Mistral 7B on a 5090 of all things? That gets ~240 tokens/sec with Llama 7B quantized to 4 bits on llama.cpp [1] and those models should be pretty similar architecturally.

I don't know exactly how the scaling works here but considering how LLM inference is memory bandwidth limited you should go beyond 100 tokens/sec with the same model and a 8 bit quantization.

1. https://github.com/ggml-org/llama.cpp/discussions/15013

▲

Sohcahtoa82 6 days ago | parent [-]

My understanding is that quantizing lowers memory usage but increases compute usage because it still needs to convert the weights to fp16 on the fly at inference time.

Clearly I'm doing something wrong if it's a net loss in performance for me. I might have to look more into this.

	▲	mysteria 6 days ago \| parent [-]
		Yes it increases compute usage but your 5090 has a hell of a lot of compute and the decompression algorithms are pretty simple. Memory is the bottleneck here and unless you have a strange GPU which has lots of fast memory but very weak compute a quantized model should always run faster. If you're using llama.cpp run the benchmark in the link I posted earlier and see what you get; I think there's something like it for vllm as well.