Thanks for the helpful reply! As I wasn't able to fully understand it still, I pasted your reply in chatgpt and asked it some follow up questions and here is what i understand from my interaction:

- Big models like GPT-4 are split across many GPUs (sharding).

- Each GPU holds some layers in VRAM.

- To process a request, weights for a layer must be loaded from VRAM into the GPU's tiny on-chip cache before doing the math.

- Loading into cache is slow, the ops are fast though.

- Without batching: load layer > compute user1 > load again > compute user2.

- With batching: load layer once > compute for all users > send to gpu 2 etc

- This makes cost per user drop massively if you have enough simultaneous users.

- But bigger batches need more GPU memory for activations, so there's a max size.

This does makes sense to me but does this sound accurate to you?

Would love to know if I'm still missing something important.

▲

frabcus 7 days ago | parent | next [-]

This seems a bit complicated to me. They don't serve very many models. My assumption is they just dedicate GPUs to specific models, so the model is always in VRAM. No loading per request - it takes a while to load a model in anyway.

The limiting factor compared to local is dedicated VRAM - if you dedicate 80GB of VRAM locally 24 hours/day so response times are fast, you're wasting most of the time when you're not querying.

▲

nodja 7 days ago | parent | next [-]

Loading here refers to loading from VRAM to the GPUs core cache, loading from VRAM is extremely slow in terms of GPU time that GPU cores end up idle most of the time just waiting for more data to come in.

	▲	frabcus 3 days ago \| parent [-]
		Thanks, got it! Think I need a deeper article on this - as comment below says you'd then need to load the request specific state in instead.

▲

6 days ago | parent | prev [-]

[deleted]

▲

nodja 7 days ago | parent | prev [-]

Yeah chatgpt pretty much nailed it.

	▲	bawana 6 days ago \| parent [-]
		But you still have to load the data for each request. And in an LLM doesnt this mean the WHOLE kv cache because the kv cache changes after every computation? So why isnt THIS the bottleneck? Gemini is talking about a context window of a million tokens- how big would the kv cache fir this get?