| ▲ | amelius 4 hours ago |
| > Of course when you have a long pipeline of chips that each token needs to pass through, that decreases the end-to-end tokens per second correspondingly. No, it only increases the latency, and does not affect the throughput. |
|
| ▲ | EdNutting 4 hours ago | parent | next [-] |
| It affects both. These systems are vastly more complex than the naive mental models being discussed in these comments. For one thing, going chip-to-chip is not a faultless process and does not operate at the same speed as on-chip communication. So, yes, throughput can be reduced by splitting a computation across two chips of otherwise equal speed. |
|
| ▲ | qudent 4 hours ago | parent | prev [-] |
| It does affect the throughput for an individual user because you need all output tokens up to n to generate output token n+1 |
| |
| ▲ | EdNutting 4 hours ago | parent [-] | | :facepalm: - That’s not how that works. | | |
| ▲ | qudent 2 hours ago | parent | next [-] | | Because inference is autoregressive (token n is an input for predicting token n+1), the forward pass for token n+1 cannot start until token n is complete. For a single stream, throughput is the inverse of latency (T = 1/L). Consequently, any increase in latency for the next token directly reduces the tokens/sec for the individual user. | |
| ▲ | catoc 3 hours ago | parent | prev | next [-] | | Your comment may be helpful - but would be much more helpful if you shared how it does work. Edit: I see you’ re doing this further down; #thumbs up | |
| ▲ | littlestymaar 3 hours ago | parent | prev [-] | | How do you think that works?! With the exception of diffusion language models that don't work this way, but are very niche, language models are autoregressive, which means you indeed need to process token in order. And that's why model speed is such a big deal, you can't just throw more hardware at the problem because the problem is latency, not compute. |
|
|