Remix clone Hacker News

new | show | ask | jobs Github

	▲	qudent 2 hours ago
		Because inference is autoregressive (token n is an input for predicting token n+1), the forward pass for token n+1 cannot start until token n is complete. For a single stream, throughput is the inverse of latency (T = 1/L). Consequently, any increase in latency for the next token directly reduces the tokens/sec for the individual user.