Remix.run Logo
ggm 4 days ago

Recognition that older linux swap strategies were unhelpful sometimes, which this piece of writing does, validates out past sense it wasn't working well. Regaining trust takes time.

Sometimes I think if backing store and swap were more clearly delineated we might have got to decent algorithms sooner. Having a huge amount of swap pre-emptively claimed was making it look like starvation, when it was just a runtime planning strategy. It's also confusing how top and vmstat report things.

Also, as a BSD mainly person, I think the differences stand out. I haven't noticed an OOM killer approach on BSD.

Ancient model: twice as much swap as memory

Old model: same amount of swap as memory

New model: amount of swap your experience tells you this job mix demands to manage memory pressure fairly, which is a bit of a tall ask sometimes, but basically pick a number up to memory size.

creshal 4 days ago | parent | next [-]

> Also, as a BSD mainly person, I think the differences stand out. I haven't noticed an OOM killer approach on BSD.

BSD allocators simply return errors if no more memory is available; for backwards compatibility reasons Linux is stuck with a fatally flawed API that doesn't.

jcalvinowens 4 days ago | parent | next [-]

You can trivially disable overcommit on Linux (vm.overcommit_memory=2) to get allocation failures instead of OOMs. But you will find yourself spending a lot more money on RAM :)

hugo1789 3 days ago | parent [-]

And debug many tools which still ignore the fact that malloc could fail.

man8alexd 4 days ago | parent | prev [-]

I assumed the same, but just discovered that FreeBSD has vm.overcommit too. But I'm not sure about its working.

toast0 3 days ago | parent [-]

Overcommit is subtle. If you allocate a bunch of address space and don't touch it, that's one thing.

If you allocate and touch everything, and then try to allocate more, it's better to get an allocation error than an unsatifyable page fault later.

My understanding (which could very well be wrong) is Linux overcommit will continue to allocate address space when asked regardless of memory pressure; but FreeBSD overcommit will refuse allocations when there's too much memory pressure.

I'm pretty sure I've seen FreeBSD's OOM killer, but it needs a specific pattern of memory use, it's much more likely for an application to get a failed allocation and exit, freeing memory; than for all the applications to have unused allocations that they then use.

All that said, I prefer to run with a small swap, somewhere around 0.5-2GB. Memory pressure is hard to measure (although recent linux has a measure that I haven't used), but swap % and swap i/o are easy to measure. If your swap grows quickly, you might not have time to do any operations to fix it, but your stats should tell the tale. If your swap grows slowly enough, you can set thresholds and analyze the situation. If you have a lot of swap i/o that provides a measure of urgency.

jcalvinowens 3 days ago | parent [-]

> If you allocate and touch everything, and then try to allocate more, it's better to get an allocation error than an unsatifyable page fault later.

It depends, but generally speaking I'd disagree with that.

The only time you actually want to see the allocation failures is if you're writing high reliability software where you've gone to the trouble to guarantee some sort of meaningful forward progress when memory is exhausted. That is VERY VERY hard, and quickly becomes impossible when you have non-trivial library dependencies.

If all you do is raise std::bad_alloc or call abort(), handling NULL return from malloc() is arguably a waste of icache: just let it crash. Dereferencing NULL is guaranteed to crash on Linux, only root can mmap() the lowest page.

Admittedly I'm anal, and I write the explicit code to check for it and call abort(), but I know very experienced programmers I respect who don't.

toast0 3 days ago | parent [-]

> If all you do is raise std::bad_alloc or call abort(), handling NULL return from malloc() is arguably a waste of icache: just let it crash. Dereferencing NULL is guaranteed to crash on Linux, only root can mmap() the lowest page.

If you don't care to handle the error, which is a totally reasonable position, there's not a whole lot of difference between the allocator returning a pointer that will make you crash on use because it's zero, and a pointer that will make you crash on use because there are no pages available. There is some difference because if you get the allocation while there are no pages available, the fallible allocator has returned a permanently dead pointer and the unfailing allocator has returned a pointer that can work in the future.

But if you do want to respond to errors, it is easier to respond to a NULL return rather than to a failed page fault. I certainly agree it's not easy to do much other than abort in most cases, but I'd rather have the opportunity to try.

jcalvinowens 2 days ago | parent [-]

> But if you do want to respond to errors, it is easier to respond to a NULL return rather than to a failed page fault.

It's just inherently incompatible with overcommit, isn't it? Like you can mmap() directly and use MAP_POPULATE|MAP_LOCKED to get what you want*, but that defeats overcommit entirely.

I guess I can imagine a syscall that takes a pointer and says "fault this page please but return an error instead of killing me if you can't", but there's an unavoidable TOCTOU problem in that it could be paged out again before you actually touch it.

A zany idea is to write a custom malloc() that uses userfaultfd to allow overcommit in userspace with it disabled in the kernel. The benefit being that userspace gets to decide what to do if a fault can't be satisfied instead of getting killed. But that would be pretty complex, and I don't know what the performance would look like.

* EDIT: Actually the manpage implies some ambiguity about whether MAP_LOCKED|MAP_POPULATE is guaranteed to avoid the first major fault, it might need mmap()+mlock(), I'd have to look more carefully...

toast0 a day ago | parent [-]

> It's just inherently incompatible with overcommit, isn't it?

It's true that if overcommit is enabled, you can't guarantee you won't end up with a page fault that can't be satisfied.

But my experience on FreeBSD, which has overcommit enabled by default and returns NULL when asked for allocations that can't be (currently) satisfied is that most of the time you get a NULL allocation rather than an unsatisfied page fault.

What typically happens is a program grows to use beyond available memory (and swap) and it does so by allocating large, but managable chunks, using them, and then repeating. At a certain point, the OS struggles, but is typically able to find a page for each fault, but the large allocation looks too big, and the allocation fails and the program aborts.

But sometimes a program changes its usage pattern and starts using allocations that had been unused. In that case, you can still trigger the fatal page faults, because overcommit let you allocate more than is there.

If you don't want to have both scenarios, you can choose to eliminate the possibility of NULL by strictly allowing all allocations (although you could run out of address space and get a NULL at that point) or you can choose to eliminate the possibility of an unsatisfied page fault by strictly disallowing overcommit. I prefer having NULL when possible, and unsatisfied page faults when not.

kijin 4 days ago | parent | prev | next [-]

For modern Linux servers with large amounts of RAM, my rule of thumb is between 1/8 and 1/32 of RAM, depending on what the machine is for.

For example, one of my database servers has 128GB of RAM and 8GB of swap. It tends to stabilize around 108GB of RAM and 5GB of swap usage under normal load, so I know that a 4GB swap would have been less than optimal. A larger swap would have been a waste as well.

ChocolateGod 3 days ago | parent | next [-]

I no longer use disk swap for servers, instead opting for Zram with a maximum is 50% of RAM capacity and a high swapiness value.

It'd be cool if Zram could apply to the RAM itself (like macOS) rather than needing a fake swap device.

LargoLasskhyfv 3 days ago | parent | next [-]

Lookie lookie! Isn't it spooky?

https://github.com/CachyOS/CachyOS-Settings/blob/master/usr/...

Resulting in https://i.postimg.cc/hP37vvpJ/screenieshottie.png

Good enough...

ChocolateGod 3 days ago | parent [-]

Yeh. I haven't yet figured out how to get zram to apply transparently to containers though, anything in another memory cgroup will never get compressed unless swap is explicitly exposed to it.

cmurf 3 days ago | parent | prev [-]

zswap

https://docs.kernel.org/admin-guide/mm/zswap.html

The cgroup accounting also now works in zswap.

ChocolateGod 3 days ago | parent [-]

Zswap requires a backing disk swap, Zram does not.

cmurf 3 days ago | parent [-]

The backing disk or file will only be written to if cache eviction on the basis of LRU comes into play, which is fine because that's probably worth the write hit. The likelihood of thrashing, the biggest complaint about disk based swap, is far reduced.

zram based swap isn't free. Its efficiency depends on the compression ratio (and cost).

man8alexd 4 days ago | parent | prev [-]

The proper rule of thumb is to make the swap large enough to keep all inactive anonymous pages after the workload has stabilized, but not too large to cause swap thrashing and a delayed OOM kill if a fast memory leak happens.

Another rule of thumb is that performance degradation due to the active working set spilling into the swap is exponential - 0.1% excess causes 2x degradation, 1% - 10x degradation, 10% - 100x degradation (assuming 10^3 difference in latency between RAM and SSD).

kijin 3 days ago | parent [-]

I would approach the issue from the other direction. Start by buying enough RAM to contain the active working set for the foreseeable future. Afterward, you can start experimenting with different swap sizes (swapfiles are easier to resize, and they perform exactly as well as swap partitions!) to see how many inactive anonymous pages you can safely swap out. If you can swap out several gigabytes, that's a bonus! But don't take that for granted. Always be prepared to move everything back into RAM when needed.

4 days ago | parent | prev [-]
[deleted]