Remix.run Logo
Twirrim 3 days ago

At OCI, our VM shapes are all single NUMA node by default. We only relatively recently added support for cross-NUMA instances, precisely because of the complications that NUMA introduces.

There are so many performance quirks, and so much software doesn't account for it yet (in part, I'd bet, because most development environments don't have multiple NUMA domains.)

Here's a fun example we found a few years ago, not sure if work has happened in the upstream kernel since: the Linux page cache wasn't fully NUMA aware, and spans NUMA nodes. Someone at work was specifically looking at NUMA performance, and chose to benchmark databases on different NUMA nodes, trying the client on the same NUMA node, and then cross NUMA node, using numactl to pin. After a bunch of tests it looked like with client and server in NUMA 0 it was appreciably faster than client and server in NUMA 1. After a reboot, and re running tests, it had flipped. NUMA 1 faster than NUMA 0. Eventually they worked out that the fast NUMA was whichever one was benchmarked first after a reboot, and from there figured out that when you ran fresh, the database client library ended up in the page cache in that NUMA domain. So if they benchmarked with server in 0, client in 1, and then benchmarked with server in 0, client in 0, that clients access to the client library ended up reaching across to the page cached version in 1, paying a nice latency penalty over and over. His solution was to run the client in a NUMA pinned docker container so that it was a unique file to the OS.