We still have the problem that auto regressive decoders are memory bound.

The new Blackwell hardware combined with TensorRT-LLM and speculative decoding consistently can hit 1,000 TPS/user barrier, comparing to closer to ~250 TPS/user (out of 10k+/TPS on the server)

Is there something I missed, this looks more like 14.4 to 56 on a 64kbps backing channel modem story. I have no doubt that there are still massive gains to be found, but they seem to be using existing constraints more efficiently, not that fios is coming.

I don’t have the budget to work on the foundational model scale, but with a draft model 10x–20x faster than target and an 60-80 acceptance rate I can see how they could promise 750/TPS (with a lot of other hard work) but I would appreciate where I should look to figure out what I am missing.

▲

rsalus a day ago | parent | next [-]

agree, from my POV the constraints are still there but we've optimized now. still haven't solved the core problems.

▲

kolinko 18 hours ago | parent | prev [-]

1000TPS - what model size?

	▲	nyrikki 9 hours ago \| parent [-]
		Maverick 400B is what Nvidia used for their claim of 1k+ TPS on Blackwell GPUs.