Clearly they are batching reasoning inference in a few multiples of 512 tokens as a throughput optimization

Isn't the standard to use continuous batching? If they are using continuous batching -- I'm curious why generated token length matters, and why they might be clustering them. If not -- I'm curious why they aren't and what is the tradeoff here.

	▲	ACCount37 3 hours ago \| parent [-]
		This "~512 batching" makes me think of things like diffusion or prefill. If they managed to put together some dirty hack that lets them generate about 512 tokens worth of reasoning in parallel instead of in sequence? That would explain it.