Remix.run Logo
anal_reactor 12 hours ago

Can someone explain to me what's the problem? I have very little knowledge of Linux kernel, but I'm curious. I've tried reading a little, but it's jargon over jargon.

alienchow 11 hours ago | parent | next [-]

I'm not familiar with the jargon either, but based on some reading it comes down to how the latest kernel treats process preempts.

Postgres uses spinlocks to hold shared memory for very critical processes. Spinlocks are an infinite loop with no sleep to attempt to hold a lock, thus "spinning". Previous kernels allowed spinlocking processes to run with PREEMPT_NONE. This flag tells the kernel to let the locking process complete their work before doing anything. Now the latest kernel removed this functionality and is interrupting spinlocking processes. So if a process that is holding a lock gets interrupted, all other postgres spinlocks processes that need the same lock spin in place for way longer times, leading to performance degradation.

anal_reactor 10 hours ago | parent [-]

Why does it only appear on arm64 and not x86?

adrian_b 10 hours ago | parent [-]

It was not architecture-related. Not using huge pages also reproduced the regression on x86.

I do not know why using huge pages mitigates the regression, but it could be just because when the application uses huge pages it uses spinlocks much less frequently so the additional delays do not accumulate enough to cause a significant performance reduction.

tux3 9 hours ago | parent [-]

The problem is the spinlock being interrupted by a minor fault (you're touching a page of memory for the first time, and the kernel needs to set it up the first time it's actually used)

If your pages are 1GB instead of 4kB, this happens much less often.

tijsvd 10 hours ago | parent | prev [-]

From what I understand in the follow up: postgres uses shared memory for buffers. This shared memory is read by a new connection while locked.

In postgres, connections are handled with a process fork, not a new thread. If such a fork first reads memory, even if it already exists, that causes a minor page fault, which goes back to the kernel so it can update memory mapping tables.

The operation under lock is only a few instructions, but if it takes longer than expected, then that causes lock contention. Regression in the kernel handling minor faults?

The whole thing is then made worse because it's a spinlock, causing all waiting processes to contend over the cpus which adds to kernel processing.

Mitigated by using huge pages, which dramatically reduces the number of mapping entries and faults. I reckon that it could also be mitigated in postgres by pre-faulting all shared memory early?