Remix.run Logo
alienchow 10 hours ago

I'm not familiar with the jargon either, but based on some reading it comes down to how the latest kernel treats process preempts.

Postgres uses spinlocks to hold shared memory for very critical processes. Spinlocks are an infinite loop with no sleep to attempt to hold a lock, thus "spinning". Previous kernels allowed spinlocking processes to run with PREEMPT_NONE. This flag tells the kernel to let the locking process complete their work before doing anything. Now the latest kernel removed this functionality and is interrupting spinlocking processes. So if a process that is holding a lock gets interrupted, all other postgres spinlocks processes that need the same lock spin in place for way longer times, leading to performance degradation.

anal_reactor 10 hours ago | parent [-]

Why does it only appear on arm64 and not x86?

adrian_b 10 hours ago | parent [-]

It was not architecture-related. Not using huge pages also reproduced the regression on x86.

I do not know why using huge pages mitigates the regression, but it could be just because when the application uses huge pages it uses spinlocks much less frequently so the additional delays do not accumulate enough to cause a significant performance reduction.

tux3 9 hours ago | parent [-]

The problem is the spinlock being interrupted by a minor fault (you're touching a page of memory for the first time, and the kernel needs to set it up the first time it's actually used)

If your pages are 1GB instead of 4kB, this happens much less often.