▲ | kijin 4 days ago | ||||||||||||||||||||||
> 6. Disabling swap doesn't prevent pathological behaviour at near-OOM, although it's true that having swap may prolong it. Whether the global OOM killer is invoked with or without swap, or was invoked sooner or later, the result is the same: you are left with a system in an unpredictable state. Having no swap doesn't avoid this. This is the most important reason I try to avoid having a large swap. The duration of pathological behavior at near-OOM is proportional to the amount of swap you have. The sooner your program is killed, the sooner your monitoring system can detect it ("Connection refused" is much more clear cut than random latency spikes) and reboot/reprovision the faulty server. We no longer live in a world where we need to keep a particular server online at all cost. When you have an army of servers, a dead server is preferable to a misbehaving server. OP tries to argue that a long period of thrashing will give you an opportunity for more visibility and controlled intervention. This does not match my experience. It takes ages even to log in to a machine that is thrashing hard, let alone run any serious commands on it. The sooner you just let it crash, the sooner you can restore the system to a working state and inspect the logs in a more comfortable environment. | |||||||||||||||||||||||
▲ | mickeyp 4 days ago | parent | next [-] | ||||||||||||||||||||||
That assumes the OOM killer kills the right thing. It may well choose to kill something ancillary, which causes your OOM program to just hang or misbehave wildly. The real danger in all of this, swap or no, is the shitty OOMKiller in Linux. | |||||||||||||||||||||||
| |||||||||||||||||||||||
▲ | bawolff 4 days ago | parent | prev | next [-] | ||||||||||||||||||||||
> OP tries to argue that a long period of thrashing will give you an opportunity for more visibility and controlled intervention. I didn't get that impression. My read was that OP was arguing for user-space process killers so the system doesn't get to the point where the system becomes unresponsive due to thrashing. | |||||||||||||||||||||||
| |||||||||||||||||||||||
▲ | danw1979 4 days ago | parent | prev | next [-] | ||||||||||||||||||||||
Amen to failing fast. A machine that is responding just enough to keep a circuit breaker closed is the scourge of distributed systems. | |||||||||||||||||||||||
▲ | heavyset_go 4 days ago | parent | prev [-] | ||||||||||||||||||||||
Maybe I'm just insane, but if I'm on a machine with ample memory, and a process for some reason can't allocate resources, I want that process to fail ASAP. Same thing with high memory pressure situations, just kill greedy/hungry processes, please. Like something is going very wrong if the system is in that state, so I want everything to die immediately. | |||||||||||||||||||||||
|