▲ | fergal_reid 7 days ago | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
I think the most direct answer is that at scale, inference can be batched, so that processing many queries together in a parallel batch is more efficient than interactively dedicating a single GPU per user (like your home setup). If you want a survey of intermediate level engineering tricks, this post we wrote on the Fin AI blog might be interesting. (There's probably a level of proprietary techniques OpenAI etc have again beyond these): https://fin.ai/research/think-fast-reasoning-at-3ms-a-token/ | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
▲ | nodja 7 days ago | parent [-] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
This is the real answer, I don't know what people above are even discussing when batching is the biggest reduction in costs. If it costs say $50k to serve one request, with batching is also costs $50k to serve 100 at the same time with minimal performance loss, I don't know what the real number of users is before you need to buy new hardware, but I know it's in the hundreds so going from $50000 to $500 in effective costs is a pretty big deal (assuming you have the users to saturate the hardware). My simple explanation of how batching works: Since the bottleneck of processing LLMs is in loading the weights of the model onto the GPU to do the computing, what you can do is instead of computing each request separately, you can compute multiple at the same time, ergo batching. Let's make a visual example, let's say you have a model with 3 sets of weights that can fit inside the GPU's cache (A, B, C) and you need to serve 2 requests (1, 2). A naive approach would be to serve them one at a time. (Legend: LA = Load weight set A, CA1 = Compute weight set A for request 1) LA->CA1->LB->CB1->LC->CC1->LA->CA2->LB->CB2->LC->CC2 But you could instead batch the compute parts together. LA->CA1->CA2->LB->CB1->CB2->LC->CC1->CC2 Now if you consider that the loading is hundreds if not thousands of times slower than computing the same data, then you'll see the big different, here's a "chart" visualizing the difference of the two approaches if it was just 10 times slower. (Consider 1 letter a unit of time.) Time spent using approach 1 (1 request at a time): LLLLLLLLLLCLLLLLLLLLLCLLLLLLLLLLCLLLLLLLLLLCLLLLLLLLLLCLLLLLLLLLLC Time spend using approach 2 (batching): LLLLLLLLLLCCLLLLLLLLLLCCLLLLLLLLLLCC The difference is even more dramatic in the real world because as I said, loading is many times slower than computing, you'd have to serve many users before you see a serious difference in speeds. I believe in the real world the restrictions is actually that serving more users requires more memory to store the activation state of the weights, so you'll end up running out of memory and you'll have to balance out how many people per GPU cluster you want to serve at the same time. TL;DR: It's pretty expensive to get enough hardware to serve an LLM, but once you do have you can serve hundreds of users at the same time with minimal performance loss. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|