| ▲ | jychang 3 hours ago | ||||||||||||||||
In practice, tps is a reflection of vram memory bandwidth during inference. So the tps tells you a lot about the hardware you're running on. Comparing tps ratios- by saying a model is roughly 2x faster or slower than another model- can tell you a lot about the active param count. I won't say it'll tell you everything; I have no clue what optimizations Opus may have, which can range from native FP4 experts to spec decoding with MTP to whatever. But considering chinese models like Deepseek and GLM have MTP layers (no clue if Qwen 3.5 has MTP, I haven't checked since its release), and Kimi is native int4, I'm pretty confident that there is not a 10x difference between Opus and the chinese models. I would say there's roughly a 2x-3x difference between Opus 4.5/4.6 and the chinese models at most. | |||||||||||||||||
| ▲ | fc417fc802 2 hours ago | parent [-] | ||||||||||||||||
> In practice, tps is a reflection of vram memory bandwidth during inference. > Comparing tps ratios- by saying a model is roughly 2x faster or slower than another model- can tell you a lot about the active param count. You sure about that? I thought you could shard between GPUs along layer boundaries during inference (but not training obviously). You just end up with an increasingly deep pipeline. So time to first token increases but aggregate tps also increases as you add additional hardware. | |||||||||||||||||
| |||||||||||||||||