Remix.run Logo
qudent 2 hours ago

Because inference is autoregressive (token n is an input for predicting token n+1), the forward pass for token n+1 cannot start until token n is complete. For a single stream, throughput is the inverse of latency (T = 1/L). Consequently, any increase in latency for the next token directly reduces the tokens/sec for the individual user.