Because inference is autoregressive (token n is an input for predicting token n+1), the forward pass for token n+1 cannot start until token n is complete. For a single stream, throughput is the inverse of latency (T = 1/L). Consequently, any increase in latency for the next token directly reduces the tokens/sec for the individual user.
With the exception of diffusion language models that don't work this way, but are very niche, language models are autoregressive, which means you indeed need to process token in order.
And that's why model speed is such a big deal, you can't just throw more hardware at the problem because the problem is latency, not compute.