| ▲ | mungoman2 5 hours ago | ||||||||||||||||
This looks very interesting. Possible to get those rates without exotic hardware. But I have to say that the comparison is not really fair. Comparison is done with a 2 B model vs frontier models that are likely 100s of times larger. Also taalas with their 15000 tok/s inference are suspiciously missing from the comparison. We need to see the comparison with this framework and useful models, which at present seems to mean ~30 B. | |||||||||||||||||
| ▲ | gaeld 4 hours ago | parent | next [-] | ||||||||||||||||
Great points. We strived to be fair as possible in the benchmark, but it's indeed not perfect. Taalas should have been added in the dedicated hardware section, even though they use 3-bit quantization when we are on FP16 (to be fair in both directions) and they burn the model directly on the card. Our tech preview is about the speed (hence the small dense model, it was easier to implement). The math checks out though to allow support for large frontier MoE models at similar speeds: - At batch size 1, GPT-OSS-120B has 5.1B active parameters - in FP8, it's in the same size ballpark than our 2B model in FP16 (5.1 GB vs 4GB). - DeepSeek V4 Flash has 13B in mixed FP4/FP8, so let's say ballpark around 3x bigger than 4GB - so in theory we could reach >1,000 tok/s on it with MI300X/H200 and up to 4k on next generation GPUs. Check out the math at the end of our blog post: https://blog.kog.ai/real-time-llm-inference-on-standard-gpus... | |||||||||||||||||
| |||||||||||||||||
| ▲ | kirtivr 4 hours ago | parent | prev | next [-] | ||||||||||||||||
They got 1K tok/s with Deepseek v4 Pro. That's kinda cool.. | |||||||||||||||||
| |||||||||||||||||
| ▲ | hirako2000 3 hours ago | parent | prev | next [-] | ||||||||||||||||
Fallacies look interesting ? Like if we aren't getting dubious claims every day ? | |||||||||||||||||
| ▲ | cyanydeez 4 hours ago | parent | prev [-] | ||||||||||||||||
likely the small model makes whatever fuzzer they designed to poke the gpus much faster optimizations. they seem to think it scales up because theyre shortening the stack. | |||||||||||||||||