Very detailed and accurate description. The author clearly knows way more than I do, but I would venture a few notes:

1. In the cloud, it can be difficult to know the NUMA characteristics of your VMs. AWS, Google, etc., do not publish it. I found the ‘lscpu’ command helpful.

2. Tools like https://github.com/SoilRos/cpu-latency plot the core-to-core latency on a 2d grid. There are many example visualisations on that page; maybe you can find the chip you are using.

3. If you get to pick VM sizes, pick ones the same size as a NUMA node on the underlying hardware. Eg prefer 64-core m8g.16xlarge over 96-core m8g.24xlarge which will span two nodes.

▲

Twirrim 3 days ago | parent | next [-]

At OCI, our VM shapes are all single NUMA node by default. We only relatively recently added support for cross-NUMA instances, precisely because of the complications that NUMA introduces.

There are so many performance quirks, and so much software doesn't account for it yet (in part, I'd bet, because most development environments don't have multiple NUMA domains.)

Here's a fun example we found a few years ago, not sure if work has happened in the upstream kernel since: the Linux page cache wasn't fully NUMA aware, and spans NUMA nodes. Someone at work was specifically looking at NUMA performance, and chose to benchmark databases on different NUMA nodes, trying the client on the same NUMA node, and then cross NUMA node, using numactl to pin. After a bunch of tests it looked like with client and server in NUMA 0 it was appreciably faster than client and server in NUMA 1. After a reboot, and re running tests, it had flipped. NUMA 1 faster than NUMA 0. Eventually they worked out that the fast NUMA was whichever one was benchmarked first after a reboot, and from there figured out that when you ran fresh, the database client library ended up in the page cache in that NUMA domain. So if they benchmarked with server in 0, client in 1, and then benchmarked with server in 0, client in 0, that clients access to the client library ended up reaching across to the page cached version in 1, paying a nice latency penalty over and over. His solution was to run the client in a NUMA pinned docker container so that it was a unique file to the OS.

▲

gmokki 3 days ago | parent | prev | next [-]

I've used https://instaguide.io/info.html?type=c5a.24xlarge#tab=lstopo

to browse the info. It is getting a bit old though.

	▲	bboreham 8 hours ago \| parent [-]
		That is nice. But no detail after gen 5, so mostly historical interest.

▲

jiggawatts 3 days ago | parent | prev | next [-]

Many clouds don’t guarantee that small instances won’t span NUMA nodes.

Here on HN I saw a comment by someone running VM scale sets with hundreds or even thousands of nodes and their trick was to overprovision and then delete the instances that span NUMA nodes.

▲

bboreham 2 days ago | parent [-]

You’re right, it’s not guaranteed. Also the hypervisor might lie to you - claiming a single NUMA node but actually crossing two on the VM host.

	▲	jiggawatts 2 days ago \| parent [-]
		You can directly measure the core-to-core latency.

▲

tuananh 3 days ago | parent | prev [-]

> Eg prefer 64-core m8g.16xlarge over 96-core m8g.24xlarge which will span two nodes.

It's sad that we have to do this by ourselves

	▲	bongodongobob 3 days ago \| parent [-]
		No it isn't. It totally depends on where your bottlenecks are.