| ▲ | LoganDark 5 hours ago | |||||||
I feel the comparison to Groq is unfair. They're running much larger models (orders of magnitude) and still reaching competitive speeds. | ||||||||
| ▲ | gaeld 4 hours ago | parent [-] | |||||||
Fair point - this tech preview is about the speed (hence the small dense model, it was easier to implement). The math checks out though to allow support for large frontier MoE models at similar speeds. At batch size 1, GPT-OSS-120B has 5.1B active parameters - in FP8, it's in the same size ballpark than our 2B model in FP16 (5.1 GB vs 4GB). DeepSeek V4 Flash has 13B in mixed FP4/FP8. Check out the math at the end of our blog post: https://blog.kog.ai/real-time-llm-inference-on-standard-gpus... | ||||||||
| ||||||||