▲ | Sohcahtoa82 6 days ago | ||||||||||||||||
FWIW, my slowness is because of quantizing. I've been using Mistral 7B, and I can get 45 tokens/sec, which is PLENTY fast, but to save VRAM so I can game while doing inference (I run an IRC bot that allows people to talk to Mistral), I quantize to 8 bits, which then brings my inference speed down to ~8 tokens/sec. For gaming, I absolutely love this card. I can play Cyberpunk 2077 with all the graphics settings set to the maximum and get 120+ fps. Though when playing a much more graphically intense game like that, I certainly need to kill the bot to free up the VRAM. But I can play something simpler like League of Legends and have inference happening while I play with zero impact on game performance. I also have 128 GB of system RAM. I've thought about loading the model in both 8-bit and 16-bit into system RAM and just swap which one is in VRAM based on if I'm playing a game so that if I'm not playing something, the bot runs significantly faster. | |||||||||||||||||
▲ | mysteria 6 days ago | parent [-] | ||||||||||||||||
Hold on, you're only getting 45 tokens/sec with Mistral 7B on a 5090 of all things? That gets ~240 tokens/sec with Llama 7B quantized to 4 bits on llama.cpp [1] and those models should be pretty similar architecturally. I don't know exactly how the scaling works here but considering how LLM inference is memory bandwidth limited you should go beyond 100 tokens/sec with the same model and a 8 bit quantization. | |||||||||||||||||
|