The short answer is "batch size". These days, LLMs are what we call "Mixture of Experts", meaning they only activate a small subset of their weights at a time. This makes them a lot more efficient to run at high batch size.

If you try to run GPT4 at home, you'll still need enough VRAM to load the entire model, which means you'll need several H100s (each one costs like $40k). But you will be under-utilizing those cards by a huge amount for personal use.

It's a bit like saying "How come Apple can make iphones for billions of people but I can't even build a single one in my garage"

▲

jsnell 7 days ago | parent | next [-]

> These days, LLMs are what we call "Mixture of Experts", meaning they only activate a small subset of their weights at a time. This makes them a lot more efficient to run at high batch size.

I don't really understand why you're trying to connect MoE and batching here. Your stated mechanism is not only incorrect but actually the wrong way around.

The efficiency of batching comes from optimally balancing the compute and memory bandwidth, by loading a tile of parameters from the VRAM to cache, applying those weights to all the batched requests, and only then loading in the next tile.

So batching only helps when multiple queries need to access the same weights for the same token. For dense models, that's just what always happens. But for MoE, it's not the case, exactly due to the reason that not all weights are always activated. And then suddenly your batching becomes a complex scheduling problem, since not all the experts at a given layer will have the same load. Surely a solvable problem, but MoE is not the enabler for batching but making it significantly harder.

	▲	ritz_labringue 6 days ago \| parent [-]
		You’re right, I conflated two things. MoE improves compute efficiency per token (only a few experts run), but it doesn’t meaningfully reduce memory footprint. For fast inference you typically keep all experts in memory (or shard them), so VRAM still scales with the total number of experts. Practically, that’s why home setups are wasteful: you buy a GPU for its VRAM capacity, but MoE only activates a fraction of the compute each token, and some experts/devices sit idle (because you are the only one using the model). MoE does not make batching more efficient, but it demands larger batches to maximize compute utilization and to amortize routing. Dense models batch trivially (same weights every token). MoE batches well once the batch is large enough so each expert has work. So the point isn’t that MoE makes batching better, but that MoE needs bigger batches to reach its best utilization.

▲

radarsat1 7 days ago | parent | prev | next [-]

I'm actually not sure I understand how MoE helps here. If you can route a single request to a specific subnetwork then yes, it saves compute for that request. But if you have a batch of 100 requests, unless they are all routed exactly the same, which feels unlikely, aren't you actually increasing the number of weights that need to be processed? (at least with respect to an individual request in the batch).

▲

arjvik 7 days ago | parent | prev | next [-]

Essentially, inference is well-amortized across the many users.

▲

robotnikman 7 days ago | parent | prev | next [-]

I wonder then if its possible to load the unused parts into main memory, while the more used parts into VRAM

▲

cududa 7 days ago | parent | prev [-]

Great metaphor