▲ | ritz_labringue 7 days ago | |||||||
The short answer is "batch size". These days, LLMs are what we call "Mixture of Experts", meaning they only activate a small subset of their weights at a time. This makes them a lot more efficient to run at high batch size. If you try to run GPT4 at home, you'll still need enough VRAM to load the entire model, which means you'll need several H100s (each one costs like $40k). But you will be under-utilizing those cards by a huge amount for personal use. It's a bit like saying "How come Apple can make iphones for billions of people but I can't even build a single one in my garage" | ||||||||
▲ | jsnell 7 days ago | parent | next [-] | |||||||
> These days, LLMs are what we call "Mixture of Experts", meaning they only activate a small subset of their weights at a time. This makes them a lot more efficient to run at high batch size. I don't really understand why you're trying to connect MoE and batching here. Your stated mechanism is not only incorrect but actually the wrong way around. The efficiency of batching comes from optimally balancing the compute and memory bandwidth, by loading a tile of parameters from the VRAM to cache, applying those weights to all the batched requests, and only then loading in the next tile. So batching only helps when multiple queries need to access the same weights for the same token. For dense models, that's just what always happens. But for MoE, it's not the case, exactly due to the reason that not all weights are always activated. And then suddenly your batching becomes a complex scheduling problem, since not all the experts at a given layer will have the same load. Surely a solvable problem, but MoE is not the enabler for batching but making it significantly harder. | ||||||||
| ||||||||
▲ | radarsat1 7 days ago | parent | prev | next [-] | |||||||
I'm actually not sure I understand how MoE helps here. If you can route a single request to a specific subnetwork then yes, it saves compute for that request. But if you have a batch of 100 requests, unless they are all routed exactly the same, which feels unlikely, aren't you actually increasing the number of weights that need to be processed? (at least with respect to an individual request in the batch). | ||||||||
▲ | arjvik 7 days ago | parent | prev | next [-] | |||||||
Essentially, inference is well-amortized across the many users. | ||||||||
▲ | robotnikman 7 days ago | parent | prev | next [-] | |||||||
I wonder then if its possible to load the unused parts into main memory, while the more used parts into VRAM | ||||||||
▲ | cududa 7 days ago | parent | prev [-] | |||||||
Great metaphor |