▲ | zozbot234 7 days ago | |||||||
> Inference is (mostly) stateless. ... you just need to route mostly small amounts of data to a bunch of big machines. I think this might just be the key insight. The key advantage of doing batched inference at a huge scale is that once you maximize parallelism and sharding, your model parameters and the memory bandwidth associated with them are essentially free (since at any given moment they're being shared among a huge amount of requests!), you "only" pay for the request-specific raw compute and the memory storage+bandwidth for the activations. And the proprietary models are now huge, highly-quantized extreme-MoE models where the former factor (model size) is huge and the latter (request-specific compute) has been correspondingly minimized - and where it hasn't, you're definitely paying "pro" pricing for it. I think this goes a long way towards explaining how inference at scale can work better than locally. (There are "tricks" you could do locally to try and compete with this setup, such as storing model parameters on disk and accessing them via mmap, at least when doing token gen on CPU. But of course you're paying for that with increased latency, which you may or may not be okay with in that context.) | ||||||||
▲ | patrick451 7 days ago | parent | next [-] | |||||||
> The key advantage of doing batched inference at a huge scale is that once you maximize parallelism and sharding, your model parameters and the memory bandwidth associated with them are essentially free (since at any given moment they're being shared among a huge amount of requests!) Kind of unrelated, but this comment made me wonder when we will start seeing side channel attacks that force queries to leak into each other. | ||||||||
| ||||||||
▲ | saagarjha 7 days ago | parent | prev [-] | |||||||
mmap is not free. It just moves bandwidth around. | ||||||||
|