| ▲ | cmrdporcupine 5 hours ago | |||||||
I'm getting similar numbers on NVIDIA Spark around 25-30 tokens/sec output, 251 token/sec prompt processing... but I'm running with the Q4_K_XL quant. I'll try the Q8 next, but that would leave less room for context. I tried FP8 in vLLM and it used 110GB and then my machine started to swap when I hit it with a query. Only room for 16k context. I suspect there will be some optimizations over the next few weeks that will pick up the performance on these type of machines. I have it writing some Rust code and it's definitely slower than using a hosted model but it's actually seeming pretty competent. These are the first results I've had on a locally hosted model that I could see myself actually using, though only once the speed picks up a bit. I suspect the API providers will offer this model for nice and cheap, too. | ||||||||
| ▲ | aseipp 4 hours ago | parent [-] | |||||||
llama.cpp is giving me ~35tok/sec with the unsloth quants (UD-Q4_K_XL, elsewhere in this thread) on my Spark. FWIW my understanding and experience is that llama.cpp seems to give slight better performance for "single user" workloads, but I'm not sure why. I'm asking it to do some analysis/explain some Rust code in a rather large open source project and it's working nicely. I agree this is a model I could possibly, maybe use locally... | ||||||||
| ||||||||