| Recognition that older linux swap strategies were unhelpful sometimes, which this piece of writing does, validates out past sense it wasn't working well. Regaining trust takes time. Sometimes I think if backing store and swap were more clearly delineated we might have got to decent algorithms sooner. Having a huge amount of swap pre-emptively claimed was making it look like starvation, when it was just a runtime planning strategy. It's also confusing how top and vmstat report things. Also, as a BSD mainly person, I think the differences stand out. I haven't noticed an OOM killer approach on BSD. Ancient model: twice as much swap as memory Old model: same amount of swap as memory New model: amount of swap your experience tells you this job mix demands to manage memory pressure fairly, which is a bit of a tall ask sometimes, but basically pick a number up to memory size. |
| |
| ▲ | jcalvinowens 4 days ago | parent | next [-] | | You can trivially disable overcommit on Linux (vm.overcommit_memory=2) to get allocation failures instead of OOMs. But you will find yourself spending a lot more money on RAM :) | | |
| ▲ | hugo1789 3 days ago | parent [-] | | And debug many tools which still ignore the fact that malloc could fail. |
| |
| ▲ | man8alexd 4 days ago | parent | prev [-] | | I assumed the same, but just discovered that FreeBSD has vm.overcommit too. But I'm not sure about its working. | | |
| ▲ | toast0 3 days ago | parent [-] | | Overcommit is subtle. If you allocate a bunch of address space and don't touch it, that's one thing. If you allocate and touch everything, and then try to allocate more, it's better to get an allocation error than an unsatifyable page fault later. My understanding (which could very well be wrong) is Linux overcommit will continue to allocate address space when asked regardless of memory pressure; but FreeBSD overcommit will refuse allocations when there's too much memory pressure. I'm pretty sure I've seen FreeBSD's OOM killer, but it needs a specific pattern of memory use, it's much more likely for an application to get a failed allocation and exit, freeing memory; than for all the applications to have unused allocations that they then use. All that said, I prefer to run with a small swap, somewhere around 0.5-2GB. Memory pressure is hard to measure (although recent linux has a measure that I haven't used), but swap % and swap i/o are easy to measure. If your swap grows quickly, you might not have time to do any operations to fix it, but your stats should tell the tale. If your swap grows slowly enough, you can set thresholds and analyze the situation. If you have a lot of swap i/o that provides a measure of urgency. | | |
| ▲ | jcalvinowens 3 days ago | parent [-] | | > If you allocate and touch everything, and then try to allocate more, it's better to get an allocation error than an unsatifyable page fault later. It depends, but generally speaking I'd disagree with that. The only time you actually want to see the allocation failures is if you're writing high reliability software where you've gone to the trouble to guarantee some sort of meaningful forward progress when memory is exhausted. That is VERY VERY hard, and quickly becomes impossible when you have non-trivial library dependencies. If all you do is raise std::bad_alloc or call abort(), handling NULL return from malloc() is arguably a waste of icache: just let it crash. Dereferencing NULL is guaranteed to crash on Linux, only root can mmap() the lowest page. Admittedly I'm anal, and I write the explicit code to check for it and call abort(), but I know very experienced programmers I respect who don't. | | |
| ▲ | toast0 3 days ago | parent [-] | | > If all you do is raise std::bad_alloc or call abort(), handling NULL return from malloc() is arguably a waste of icache: just let it crash. Dereferencing NULL is guaranteed to crash on Linux, only root can mmap() the lowest page. If you don't care to handle the error, which is a totally reasonable position, there's not a whole lot of difference between the allocator returning a pointer that will make you crash on use because it's zero, and a pointer that will make you crash on use because there are no pages available. There is some difference because if you get the allocation while there are no pages available, the fallible allocator has returned a permanently dead pointer and the unfailing allocator has returned a pointer that can work in the future. But if you do want to respond to errors, it is easier to respond to a NULL return rather than to a failed page fault. I certainly agree it's not easy to do much other than abort in most cases, but I'd rather have the opportunity to try. | | |
| ▲ | jcalvinowens 2 days ago | parent [-] | | > But if you do want to respond to errors, it is easier to respond to a NULL return rather than to a failed page fault. It's just inherently incompatible with overcommit, isn't it? Like you can mmap() directly and use MAP_POPULATE|MAP_LOCKED to get what you want*, but that defeats overcommit entirely. I guess I can imagine a syscall that takes a pointer and says "fault this page please but return an error instead of killing me if you can't", but there's an unavoidable TOCTOU problem in that it could be paged out again before you actually touch it. A zany idea is to write a custom malloc() that uses userfaultfd to allow overcommit in userspace with it disabled in the kernel. The benefit being that userspace gets to decide what to do if a fault can't be satisfied instead of getting killed. But that would be pretty complex, and I don't know what the performance would look like. * EDIT: Actually the manpage implies some ambiguity about whether MAP_LOCKED|MAP_POPULATE is guaranteed to avoid the first major fault, it might need mmap()+mlock(), I'd have to look more carefully... | | |
| ▲ | toast0 a day ago | parent [-] | | > It's just inherently incompatible with overcommit, isn't it? It's true that if overcommit is enabled, you can't guarantee you won't end up with a page fault that can't be satisfied. But my experience on FreeBSD, which has overcommit enabled by default and returns NULL when asked for allocations that can't be (currently) satisfied is that most of the time you get a NULL allocation rather than an unsatisfied page fault. What typically happens is a program grows to use beyond available memory (and swap) and it does so by allocating large, but managable chunks, using them, and then repeating. At a certain point, the OS struggles, but is typically able to find a page for each fault, but the large allocation looks too big, and the allocation fails and the program aborts. But sometimes a program changes its usage pattern and starts using allocations that had been unused. In that case, you can still trigger the fatal page faults, because overcommit let you allocate more than is there. If you don't want to have both scenarios, you can choose to eliminate the possibility of NULL by strictly allowing all allocations (although you could run out of address space and get a NULL at that point) or you can choose to eliminate the possibility of an unsatisfied page fault by strictly disallowing overcommit. I prefer having NULL when possible, and unsatisfied page faults when not. |
|
|
|
|
|
|
| |
| ▲ | ChocolateGod 3 days ago | parent | next [-] | | I no longer use disk swap for servers, instead opting for Zram with a maximum is 50% of RAM capacity and a high swapiness value. It'd be cool if Zram could apply to the RAM itself (like macOS) rather than needing a fake swap device. | | | |
| ▲ | man8alexd 4 days ago | parent | prev [-] | | The proper rule of thumb is to make the swap large enough to keep all inactive anonymous pages after the workload has stabilized, but not too large to cause swap thrashing and a delayed OOM kill if a fast memory leak happens. Another rule of thumb is that performance degradation due to the active working set spilling into the swap is exponential - 0.1% excess causes 2x degradation, 1% - 10x degradation, 10% - 100x degradation (assuming 10^3 difference in latency between RAM and SSD). | | |
| ▲ | kijin 3 days ago | parent [-] | | I would approach the issue from the other direction. Start by buying enough RAM to contain the active working set for the foreseeable future. Afterward, you can start experimenting with different swap sizes (swapfiles are easier to resize, and they perform exactly as well as swap partitions!) to see how many inactive anonymous pages you can safely swap out. If you can swap out several gigabytes, that's a bonus! But don't take that for granted. Always be prepared to move everything back into RAM when needed. |
|
|