Really!
I've written about this in my comments here...
Here's a brief summary:
- typical thread-per-client programming is terribly wasteful because it needs large stacks that must be able to grow (within reason), and this leads programmers to smear client state all over the stack, which then means that the memory and _cache_ footprint of per-client state is huge even though the state is highly compressible, and this is what reduces efficiency when you want to C10K (serve 10,000 clients), and this is what led to C10K techniques in the 90s
- you can most highly compress said per-client program state by using continuation passing style (CPS) async I/O programming, either hand-coded or using modern async functions in languages that have them -- this approach tends to incentivize the programmer to compress program state into a much smaller structure than an execution stack, which therefore greatly reduces the per-client memory and _cache_ footprint of the program
Note that reducing the memory and cache footprint of a program also reduces its memory bandwidth footprint, and increases cache locality and reduces latency. This means you can server more clients with the same hardware. THIS is the point of C10K techniques.
All other techniques like fibers and green threads sit on the spectrum from hand-coded CPS async I/O programming to thread-per-client sequential programming. You get to pick how efficient your code is going to be.
Now, when you apply C10K techniques you get to have one thread-per-CPU -- look ma'! no context switches -- which also improves latency and efficiency, naturally. But there's another neat thing to thread-per-CPU: it becomes easier to manage the overall health of the service, at least per-CPU, because you can now manage all your connected clients, and so you can have eviction policies for connections, and admittance policies too. In particular you can set a maximum number of connections, which means that you can set maxima that only slightly oversubscribe the hardware's capabilities.
Otherwise [and even if you do the thread-per-CPU thing, though it's less important to have circuit breakers if you do] you must have some way to measure the health of your service, and you need to monitor it, and you need your monitor to be able to "break the circuit" by telling your service to start rejecting new work. This is where HTTP status 429 and similar come into play -- it's just a way to express backpressure to clients, though flow control will also do, if you have that available to exercise. You'll still need to be able to monitor load, latencies, and throughput for thread-per-CPU services, naturally, so you know when you need to add HW. And of course you'll want to build services you can scale horizontally as much as possible so that adding hardware is easy, though too you need to be able to find and stop pathological clients (dealing with DoS and DDoS almost always requires components external to your services).
Make sure all middleware can respond appropriately to backpressure, including having their own circuit breakers, and you have a pretty resilient system -- one that under extreme pressure will be able to shed load and continue to function to some degree.
You'll need to be able to express client priorities (so that you can keep servicing part of your load) and quotas (so that pathological high-priority clients don't DoS you).
There's much more to say, naturally.
BTW, I checked today and LLMs seem to know all this stuff, and more than that they'll be able to point you to frameworks, blogs, and lots of other docs. That said, if you don't prompt them to tell you about the thread-per-CPU stuff, they won't.
Keep in mind that C10K techniques are expensive in terms of developer time, especially for junior developers.