| ▲ | nyrikki a day ago | |||||||
We still have the problem that auto regressive decoders are memory bound. The new Blackwell hardware combined with TensorRT-LLM and speculative decoding consistently can hit 1,000 TPS/user barrier, comparing to closer to ~250 TPS/user (out of 10k+/TPS on the server) Is there something I missed, this looks more like 14.4 to 56 on a 64kbps backing channel modem story. I have no doubt that there are still massive gains to be found, but they seem to be using existing constraints more efficiently, not that fios is coming. I don’t have the budget to work on the foundational model scale, but with a draft model 10x–20x faster than target and an 60-80 acceptance rate I can see how they could promise 750/TPS (with a lot of other hard work) but I would appreciate where I should look to figure out what I am missing. | ||||||||
| ▲ | rsalus a day ago | parent | next [-] | |||||||
agree, from my POV the constraints are still there but we've optimized now. still haven't solved the core problems. | ||||||||
| ▲ | kolinko 18 hours ago | parent | prev [-] | |||||||
1000TPS - what model size? | ||||||||
| ||||||||