Remix.run Logo
rvnx 7 hours ago

I think you have to see this as a bunch of stateless requests, and this makes the problem way easier.

  LLM requests that do not call tools do not need anything external by definition.
  No central server, nothing, they can even survive without the context cache.
  All you need is to load (and only once!) the read-only immutable model weights from a S3-like source on startup.

  If it takes 4 servers to process a request, then you can group them 4 by 4, and then send a request to each group (sharding).

  Copy-paste the exact same-setup XXX times and there you have your highly-parallelizable service (until you run out of money).
It's very doable, any serious SRE can find a way setup "larger than one card" models like Kimi or DeepSeek (unquantized) if they have a tightly-coupled HPC (or a pair of very very beefy servers).

If you run out of servers, then again a money problem, but not an architectural problem (and modern datacenters are already scalable).

Take the best SRE, but no budget, and there is no solution.

So inference is the easy part.

Codex or Claude Code if it takes lot of time or have slow cold latency, it's considered very acceptable.

Some users would probably not even see the difference if a request takes 2 minutes versus 3 minutes.

The real difficult part is to have context caching and external tools, because now you are depending on services that might be lagging.

  Executing code, browsing the web, all of that is tricky to scale because they are very unreliable (tends to timeout, requires large cache of web pages, circumventing captchas, etc).
These are traditional scaling problems, but they are more difficult because all these pieces are fragile and queues can snowball easily.