| ▲ | ismailmaj 4 hours ago | |||||||
I expected a 4090, maybe 2. I did not expect 8xH200 for a 2B model. | ||||||||
| ▲ | gaeld 4 hours ago | parent [-] | |||||||
Great points, let me clarify: - model size: 2B is just for this preview (it was faster to implement), our article explains how we expect to support large frontier MoE at 1,000 to 5,000 tokens/s - reaching 500 tok/s, or even up to ~1,000 tok/s, on a consumer GPU card is possible with existing inference engines like vLLM. But there is a ceiling. The hard part comes we you try to be faster than that: these frameworks won't scale higher just by adding GPUs or using faster GPUs. There is a "glass ceiling" due to microseconds lost everywhere in the stack (grid syncs, inter-GPU comms, kernel launches, CPU sampling, etc.). All our work at Kog is about removing these bottlenecks. | ||||||||
| ||||||||