Remix.run Logo
pphysch 5 days ago

It's not totally surprising given the economics of LLM operation. LLMs, when idle, are much more resource-heavy than an idle web service. To achieve acceptable chat response latency, the models need to be already loaded in memory, and I doubt that these huge SotA models can go from cold start to inference in milliseconds or even seconds. OpenAI is incentivized to push as many users onto as few models as possible to manage the capacity and increase efficiency.

saurik 5 days ago | parent | next [-]

Unless the overall demand is doing massive sudden swings throughout the day between models, this effect should not matter; I would expect the number of wasted computers to be merely on the order of the number of models (so like, maybe 19 wasted computers) even if you have hundreds of thousands of computers operating.

danpalmer 5 days ago | parent | prev [-]

This was my thought. They messaged quite heavily in advance that they were capacity constrained, and I'd guess they just want to shuffle out GPT-4 serving as quickly as possible as its utilisation will only get worse over time, and that's time they can be utilising better for GPT-5 serving.