Continuous batching from first principles (2025)

good writeup but im curious about tail latency under mixed prompts. if one request has huge context and another is tiny, do you bucket by expected decode length or just fifo with continuous refill?

also did you test fairness knobs? ive seen p95 improve while a few tenants get starved unless there is some aging policy.

▲

charcircuit 4 hours ago | parent | prev | next [-]

This article does not explain what happens if the multiple prompts need different experts. Does it try and schedule the maximum number experts into memory to try and run the maximum number of prompts at once? Scheduling gets very complicated and there are different trade offs around fairness of processing which prompts at which times.

▲

asteroidburger 3 hours ago | parent | prev [-]

How long until “first principles” is a meme like “considered harmful”? Or are we there already?

	▲	wavemode 3 minutes ago \| parent [-]
		"from first principles" has been a common phrase in science and philosophy for a long time: https://en.wikipedia.org/wiki/First_principle