| ▲ | rvnx 7 hours ago | |
I think you have to see this as a bunch of stateless requests, and this makes the problem way easier.
It's very doable, any serious SRE can find a way setup "larger than one card" models like Kimi or DeepSeek (unquantized) if they have a tightly-coupled HPC (or a pair of very very beefy servers).If you run out of servers, then again a money problem, but not an architectural problem (and modern datacenters are already scalable). Take the best SRE, but no budget, and there is no solution. So inference is the easy part. Codex or Claude Code if it takes lot of time or have slow cold latency, it's considered very acceptable. Some users would probably not even see the difference if a request takes 2 minutes versus 3 minutes. The real difficult part is to have context caching and external tools, because now you are depending on services that might be lagging.
These are traditional scaling problems, but they are more difficult because all these pieces are fragile and queues can snowball easily. | ||