| ▲ | kleton 4 hours ago | |||||||
Clearly they are batching reasoning inference in a few multiples of 512 tokens as a throughput optimization | ||||||||
| ▲ | kbdiaz 4 hours ago | parent [-] | |||||||
Isn't the standard to use continuous batching? If they are using continuous batching -- I'm curious why generated token length matters, and why they might be clustering them. If not -- I'm curious why they aren't and what is the tradeoff here. | ||||||||
| ||||||||