More than by the downtime I am much more surprised by the actual uptime. Hard to imagine how difficult this must be, given the speed of growth.

▲ nippoo 8 hours ago | parent | next [-]

Truly! As someone who's worked with HPC and GPUs in a scientific research context, trying to get a service like this to work reliably is a different ballgame to your usual webapp stack...

▲ lostlogin 8 hours ago | parent | next [-]

But… imagine that same scientific research but you have an unlimited budget. I’d imagine that helps.

Some of the comments here mention their monthly spend, and it’s eye watering.

	▲	handoflixue 5 hours ago \| parent [-]
		It would be "unlimited budget" if they were a monopoly, but they're in a bidding war with three other "unlimited" budget AI companies, over a resource no one expected to be scarce. There's simply not enough supply to meet demand, no matter how much money you have

▲ rvnx 7 hours ago | parent | prev | next [-]

I think you have to see this as a bunch of stateless requests, and this makes the problem way easier.

  LLM requests that do not call tools do not need anything external by definition.
  No central server, nothing, they can even survive without the context cache.
  All you need is to load (and only once!) the read-only immutable model weights from a S3-like source on startup.

  If it takes 4 servers to process a request, then you can group them 4 by 4, and then send a request to each group (sharding).

  Copy-paste the exact same-setup XXX times and there you have your highly-parallelizable service (until you run out of money).

It's very doable, any serious SRE can find a way setup "larger than one card" models like Kimi or DeepSeek (unquantized) if they have a tightly-coupled HPC (or a pair of very very beefy servers).

If you run out of servers, then again a money problem, but not an architectural problem (and modern datacenters are already scalable).

Take the best SRE, but no budget, and there is no solution.

So inference is the easy part.

Codex or Claude Code if it takes lot of time or have slow cold latency, it's considered very acceptable.

Some users would probably not even see the difference if a request takes 2 minutes versus 3 minutes.

The real difficult part is to have context caching and external tools, because now you are depending on services that might be lagging.

  Executing code, browsing the web, all of that is tricky to scale because they are very unreliable (tends to timeout, requires large cache of web pages, circumventing captchas, etc).

These are traditional scaling problems, but they are more difficult because all these pieces are fragile and queues can snowball easily.

▲ CSSer 8 hours ago | parent | prev | next [-]

Can you speak a little more to this? I'm curious what kind of parameters one must consider/monitor and what kind of novel things could go wrong.

	▲	aleksiy123 7 hours ago \| parent [-]
		My guesses are: hardware capacity constraints is going to be the big one Effective caching is another, I bet if you start hitting cold caches the whole things going to degrade rapidly. The ground is probably shifting pretty rapidly. Power users are trying to get the most out of their subscriptions and so are hammering you as fast as they possibly can. See Ralph loops. Harnesses are evolving pretty rapidly, as well as new alternatives harnesses. Makes the load patterns less predictable, harder to cache. The demand is increasing both from more customers, but also from each user as they figure out more effective workflows. Users are pretty sensitive to model quality changes. You probably want smart routing, but users want the best model all the time. Models keep getting bigger and bigger. On top of that they are probably hiring more onboarding more, system complexity and codebase complexity is growing.

▲ Yhippa 5 hours ago | parent | prev [-]

Just ask Claude and some agents to fix it...

▲ wrs 8 hours ago | parent | prev | next [-]

On the other hand, the status page is blaming the authentication system, which one would think is not a frontier-class problem.

▲ Havoc 3 hours ago | parent | prev [-]

Would have thought that compared to training the serving part is pretty easy. Less of a “everything needs to come together at once” and more just move demand to a working cluster if one bombs & have some spare capacity