▲ | superasn 7 days ago | ||||||||||||||||||||||
Thanks for the helpful reply! As I wasn't able to fully understand it still, I pasted your reply in chatgpt and asked it some follow up questions and here is what i understand from my interaction: - Big models like GPT-4 are split across many GPUs (sharding). - Each GPU holds some layers in VRAM. - To process a request, weights for a layer must be loaded from VRAM into the GPU's tiny on-chip cache before doing the math. - Loading into cache is slow, the ops are fast though. - Without batching: load layer > compute user1 > load again > compute user2. - With batching: load layer once > compute for all users > send to gpu 2 etc - This makes cost per user drop massively if you have enough simultaneous users. - But bigger batches need more GPU memory for activations, so there's a max size. This does makes sense to me but does this sound accurate to you? Would love to know if I'm still missing something important. | |||||||||||||||||||||||
▲ | frabcus 7 days ago | parent | next [-] | ||||||||||||||||||||||
This seems a bit complicated to me. They don't serve very many models. My assumption is they just dedicate GPUs to specific models, so the model is always in VRAM. No loading per request - it takes a while to load a model in anyway. The limiting factor compared to local is dedicated VRAM - if you dedicate 80GB of VRAM locally 24 hours/day so response times are fast, you're wasting most of the time when you're not querying. | |||||||||||||||||||||||
| |||||||||||||||||||||||
▲ | nodja 7 days ago | parent | prev [-] | ||||||||||||||||||||||
Yeah chatgpt pretty much nailed it. | |||||||||||||||||||||||
|