| ▲ | Gracana 10 hours ago | |||||||
I'm running the Q4_K_M quant on a xeon with 7x A4000s and I'm getting about 8 tok/s with small context (16k). I need to do more tuning, I think I can get more out of it, but it's never gonna be fast on this suboptimal machine. | ||||||||
| ▲ | segmondy 9 hours ago | parent | next [-] | |||||||
you can add 1 more GPU so you can take advantage of tensor parallel. I get the same speed with 5 3090's with most of the model on 2400mhz ddr4 ram, 8.5tk almost constant. I don't really do agents but chat, and it holds up to 64k. | ||||||||
| ||||||||
| ▲ | esafak 9 hours ago | parent | prev [-] | |||||||
The pitiful state of GPUs. $10K for a sloth with no memory. | ||||||||