I'm getting similar numbers on NVIDIA Spark around 25-30 tokens/sec output, 251 token/sec prompt processing... but I'm running with the Q4_K_XL quant. I'll try the Q8 next, but that would leave less room for context.

I tried FP8 in vLLM and it used 110GB and then my machine started to swap when I hit it with a query. Only room for 16k context.

I suspect there will be some optimizations over the next few weeks that will pick up the performance on these type of machines.

I have it writing some Rust code and it's definitely slower than using a hosted model but it's actually seeming pretty competent. These are the first results I've had on a locally hosted model that I could see myself actually using, though only once the speed picks up a bit.

I suspect the API providers will offer this model for nice and cheap, too.

▲

aseipp 4 hours ago | parent [-]

llama.cpp is giving me ~35tok/sec with the unsloth quants (UD-Q4_K_XL, elsewhere in this thread) on my Spark. FWIW my understanding and experience is that llama.cpp seems to give slight better performance for "single user" workloads, but I'm not sure why.

I'm asking it to do some analysis/explain some Rust code in a rather large open source project and it's working nicely. I agree this is a model I could possibly, maybe use locally...

	▲	cmrdporcupine 4 hours ago \| parent [-]
		Yeah I got 35-39tok/sec for one shot prompts, but for real-world longer context interactions through opencode it seems to be averaging out to 20-30tok/sec. I tried both MXFP4 and Q4_K_XL, no big difference, unfortunately. --no-mmap --fa on options seemed to help, but not dramatically. As with everything Spark, memory bandwidth is the limitation. I'd like to be impressed with 30tok/sec but it's sort of a "leave it overnight and come back to the results" kind of experience, wouldn't replace my normal agent use. However I suspect in a few days/weeks DeepInfra.com and others will have this model (maybe Groq, too?), and will serve it faster and for fairly cheap.