I think the most direct answer is that at scale, inference can be batched, so that processing many queries together in a parallel batch is more efficient than interactively dedicating a single GPU per user (like your home setup).

If you want a survey of intermediate level engineering tricks, this post we wrote on the Fin AI blog might be interesting. (There's probably a level of proprietary techniques OpenAI etc have again beyond these): https://fin.ai/research/think-fast-reasoning-at-3ms-a-token/

▲

nodja 7 days ago | parent [-]

This is the real answer, I don't know what people above are even discussing when batching is the biggest reduction in costs. If it costs say $50k to serve one request, with batching is also costs $50k to serve 100 at the same time with minimal performance loss, I don't know what the real number of users is before you need to buy new hardware, but I know it's in the hundreds so going from $50000 to $500 in effective costs is a pretty big deal (assuming you have the users to saturate the hardware).

My simple explanation of how batching works: Since the bottleneck of processing LLMs is in loading the weights of the model onto the GPU to do the computing, what you can do is instead of computing each request separately, you can compute multiple at the same time, ergo batching.

Let's make a visual example, let's say you have a model with 3 sets of weights that can fit inside the GPU's cache (A, B, C) and you need to serve 2 requests (1, 2). A naive approach would be to serve them one at a time.

(Legend: LA = Load weight set A, CA1 = Compute weight set A for request 1)

LA->CA1->LB->CB1->LC->CC1->LA->CA2->LB->CB2->LC->CC2

But you could instead batch the compute parts together.

LA->CA1->CA2->LB->CB1->CB2->LC->CC1->CC2

Now if you consider that the loading is hundreds if not thousands of times slower than computing the same data, then you'll see the big different, here's a "chart" visualizing the difference of the two approaches if it was just 10 times slower. (Consider 1 letter a unit of time.)

Time spent using approach 1 (1 request at a time):

LLLLLLLLLLCLLLLLLLLLLCLLLLLLLLLLCLLLLLLLLLLCLLLLLLLLLLCLLLLLLLLLLC

Time spend using approach 2 (batching):

LLLLLLLLLLCCLLLLLLLLLLCCLLLLLLLLLLCC

The difference is even more dramatic in the real world because as I said, loading is many times slower than computing, you'd have to serve many users before you see a serious difference in speeds. I believe in the real world the restrictions is actually that serving more users requires more memory to store the activation state of the weights, so you'll end up running out of memory and you'll have to balance out how many people per GPU cluster you want to serve at the same time.

TL;DR: It's pretty expensive to get enough hardware to serve an LLM, but once you do have you can serve hundreds of users at the same time with minimal performance loss.

▲

superasn 7 days ago | parent [-]

Thanks for the helpful reply! As I wasn't able to fully understand it still, I pasted your reply in chatgpt and asked it some follow up questions and here is what i understand from my interaction:

- Big models like GPT-4 are split across many GPUs (sharding).

- Each GPU holds some layers in VRAM.

- To process a request, weights for a layer must be loaded from VRAM into the GPU's tiny on-chip cache before doing the math.

- Loading into cache is slow, the ops are fast though.

- Without batching: load layer > compute user1 > load again > compute user2.

- With batching: load layer once > compute for all users > send to gpu 2 etc

- This makes cost per user drop massively if you have enough simultaneous users.

- But bigger batches need more GPU memory for activations, so there's a max size.

This does makes sense to me but does this sound accurate to you?

Would love to know if I'm still missing something important.

▲

frabcus 7 days ago | parent | next [-]

This seems a bit complicated to me. They don't serve very many models. My assumption is they just dedicate GPUs to specific models, so the model is always in VRAM. No loading per request - it takes a while to load a model in anyway.

The limiting factor compared to local is dedicated VRAM - if you dedicate 80GB of VRAM locally 24 hours/day so response times are fast, you're wasting most of the time when you're not querying.

▲

nodja 7 days ago | parent | next [-]

Loading here refers to loading from VRAM to the GPUs core cache, loading from VRAM is extremely slow in terms of GPU time that GPU cores end up idle most of the time just waiting for more data to come in.

	▲	frabcus 3 days ago \| parent [-]
		Thanks, got it! Think I need a deeper article on this - as comment below says you'd then need to load the request specific state in instead.

▲

6 days ago | parent | prev [-]

[deleted]

▲

nodja 7 days ago | parent | prev [-]

Yeah chatgpt pretty much nailed it.

	▲	bawana 6 days ago \| parent [-]
		But you still have to load the data for each request. And in an LLM doesnt this mean the WHOLE kv cache because the kv cache changes after every computation? So why isnt THIS the bottleneck? Gemini is talking about a context window of a million tokens- how big would the kv cache fir this get?