If you query the NUMA layout tree, you have two sibling hw threads per core, then a cluster of 8 or 12 actual cores per die (up to 4 or 8 dies per socket), then the individual sockets (up to 2 sockets per machine).

Before 8 cores per die (introduced in Zen 3, and retained in 4, 5 and 6), the Zen 1/+ and 2 series this would have been two sets of four cores instead of one set of eight (and a split L3 instead of a unified one). I can't remember if the split-CCX had its own NUMA layer in the tree or not, or if they were just iterated in pairs.

▲

fc417fc802 2 days ago | parent [-]

What I find myself wondering about is the performance impacts of cross-thread communication in this scenario. With the nested domains it seems like there should be different (and increasingly severe) performance implications for crossing each distinct boundary. Whereas the languages we write in and the programming models we employ don't seem particularly well suited to expressing how we want our code to adapt to such constraints at present, at least not in a generalized manner.

I realize that HPC code can be customized to the specific device it will be run on but more widely deployed software is going to want to abstract these increasingly complex relationships.

	▲	DiabloD3 2 days ago \| parent \| next [-]
		Its why, if you want high performance code in this sort of work, you'll want either C or C-like code. For example, learn how madvise() is used. Also, learn how thread local storage works in the context of implementing it on a hierarchical SMP system. Also, learn how to make a message passing system and what "atomic" means (locks are often not your friend here). Ironically, a lot of people keep shooting themselves in the foot and blindly using MPI or OpenMP or any of the other popular industry supported frameworks, and then thinking that magically bails them out. It doesn't. The most important thing you need, above all others: make sure the problem you're solving can be parallelized, and CPUs are the right way of doing it. Once you've answered this question, and the answer is yes, you can just write it pretty much normally. Also, ironically, you can write Java that isn't shit and takes advantage of systems like these. Sun and the post-Sun community put a lot of work into the Hotspot JVM to make it scale alarmingly well on high core count machines. Java used correctly is performant.
	▲	bee_rider 2 days ago \| parent \| prev [-]
		Chips and cheese did some measurements for previous AMD generations, they have a pretty core to core latency measurement a little bit after halfway down the page. https://chipsandcheese.com/p/genoa-x-server-v-cache-round-2