Solid writeup of NUMA, scheduling, and the need for pinning for folks who don’t spend a lot of time in the IT side of things (where we, unfortunately, have been wrangling with this for over a decade). The long and short of it is that if you’re building a HPC application, or are sensitive to throughput and latency on your cutting-edge/high-traffic system design, then you need to manually pin your workloads for optimal performance.

One thing the writeup didn’t seem to get into is the lack of scalability of this approach (manual pinning). As core counts and chiplets continue to explode, we still need better ways of scaling manual pinning or building more NUMA-aware OSes/applications that can auto-schedule with minimal penalties. Don’t get me wrong, it’s a lot better than ye olden days of dual core, multi-socket servers and stern warnings against fussing with NUMA schedulers from vendors if you wanted to preserve basic functionality, but it’s not a solved problem just yet.

▲

colechristensen 3 days ago | parent | next [-]

This is one of those way down the road optimizations for folks in fairly rare scale situations in fairly rare tight loops.

Most of us are in the realm of the lowest hanging fruit being database queries that could be 100x faster and functions being called a million times a day that only need to be called twice.

▲

stego-tech 3 days ago | parent | next [-]

100% with you there. I can count one time in my entire 15 years where I had to pin a production workload for performance, and it was Hyperion.

In 99% of use cases, there’s other, easier optimizations to be had. You’ll know if you’re in the 1% workload pinning is advantageous to.

For everyone else, it’s an excellent explainer why most guides and documentation will sternly warn you against fussing with the NUMA scheduler.

	▲	toast0 3 days ago \| parent [-]
		> In 99% of use cases, there’s other, easier optimizations to be had. You’ll know if you’re in the 1% workload pinning is advantageous to. Cpu pinning can be super easy too. If you have an application that uses the whole machine, you probably already spawn one thread per cpu thread. Pinning those threads is usually pretty easy. Checking if it makes a difference might be harder... For most applications, it won't make a big difference, but some applications will see a big difference. Usually a positive difference, but it depends on the application. If nobody has tried cpu pinning your application lately, it's worth trying. Of course, doing something efficiently is nice, but not doing it is often a lot faster... Not doing things that don't need to be done has huge potential speedups. If you want to cpu pin network sockets, that's not as easy, but it can also make a big difference in some circumstances; mostly if you're a load balancer/proxy kind of thing where you don't spend much time processing packets, just receive and forward. In that case, avoiding cross cpu reads and writes can provide huge speedups, but it's not easy. That one, yeah, only do it if you have a good idea it will help, it's kind of invasive and it won't be noticable if you do a lot of work on requests.

▲

bboreham 3 days ago | parent | prev | next [-]

Whilst you’re right in broad strokes, I would observe that “the garbage-collector” is one of those tight loops. Single-threaded JavaScript is perhaps one of the best defences against NUMA, but anyone running a process on multiple cores and multiple gigabytes should at least know about the problem.

▲

frollogaston 3 days ago | parent | prev [-]

Yeah, I was once in this situation with a perf-focused software defined networking project. Pinning to the wrong NUMA node slowed it down badly.

Probably another situation is if you're working on a DBMS itself.

▲

ccgreg 3 days ago | parent | prev | next [-]

> The long and short of it is that if you’re building a HPC application, or are sensitive to throughput and latency on your cutting-edge/high-traffic system design, then you need to manually pin your workloads for optimal performance.

Last time I was architect of a network chip, 21 years ago, our library did that for the user. For workloads that use threads that consume entire cores, it's a solved problem.

I'd guess that the workload you had in mind doesn't have that property.

▲

jasonjayr 3 days ago | parent | prev | next [-]

This strikes me as something that Kubernetes could handle if it could support it. You can use affinity to ensure workloads stay together on the same machines, if K8s was NUMA aware, you could extend that affinity/anti-affinity mechanism down to the core/socket level.

EDIT: aaaand ... I commented before reading the article, which describes this very mechanism.

	▲	jauntywundrkind 3 days ago \| parent [-]
		It'd be great to see Kubernetes make more extensive use of croups & especially nested croups, imo. The cpuset affinity should build into that layer nicely, imo. More broadly, Kubernetes' desire to schedule everything itself, to fit the workloads intelligent to insure successful running, feels like an anti-partern when the kernel has a much more aggressive way to let you trade off and define priorities and bound resources; it sucks having the ultra lo-fi kube take. I want the kernels "let it fail" version where nested cgroups get to fight it out according to their allocations. Really enjoyed this amazing write up on how Kube does use cgroups. Seems like the QoS controls do give some top level cgroups, that pods then nest inside of. That's something. At least! https://martinheinz.dev/blog/91

▲

wmf 3 days ago | parent | prev | next [-]

If auto-NUMA doesn't handle your workload well and you don't want to manually pin anything, it's always possible to use single-socket servers and set NPS=1. This will make everything uniformly "slow" (which is not that slow).

	▲	ccgreg 3 days ago \| parent [-]
		Historically, the Sparc 6400 was derided for not being NUMA, but instead being Uniformly Slow.

▲

PerryStyle 3 days ago | parent | prev | next [-]

There are some solutions that try to tackle this in HPC. For example https://github.com/LLNL/mpibind is deployed on El Capitan.

Would be interesting to see if something similar appears for cloud workloads.

▲

uberduper 2 days ago | parent | prev [-]

Kubernetes will handle this automatically if you configure the Topology Manager.