Remix.run Logo
Neywiny 2 days ago

32 cores on a die, 256 on a package. Still stunning though

bee_rider 2 days ago | parent [-]

How do people use these things? Map MPI ranks to dies, instead of compute nodes?

wmf 2 days ago | parent | next [-]

Yeah, there's an option to configure one NUMA node per CCD that can speed up some apps.

markhahn 2 days ago | parent | prev [-]

MPI is fine, but have you heard of threads?

bee_rider 2 days ago | parent [-]

Sure, the conventional way of doing things is OpenMP on a node and MPI across nodes, but

* It just seems like a lot of threads to wrangle without some hierarchy. Nested OpenMP is also possible…

* I’m wondering if explicit communication is better from one die to another in this sort of system.

fc417fc802 2 days ago | parent [-]

With 2 IO dies aren't there effectively 2 meta NUMA nodes with 4 leaf nodes each? Or am I off base there?

The above doesn't even consider the possibility of multi-CPU systems. I suspect the existing programming models are quickly going to become insufficient for modeling these systems.

I also find myself wondering how atomic instruction performance will fare on these. GPU ISA and memory model on CPU when?

DiabloD3 2 days ago | parent | next [-]

If you query the NUMA layout tree, you have two sibling hw threads per core, then a cluster of 8 or 12 actual cores per die (up to 4 or 8 dies per socket), then the individual sockets (up to 2 sockets per machine).

Before 8 cores per die (introduced in Zen 3, and retained in 4, 5 and 6), the Zen 1/+ and 2 series this would have been two sets of four cores instead of one set of eight (and a split L3 instead of a unified one). I can't remember if the split-CCX had its own NUMA layer in the tree or not, or if they were just iterated in pairs.

fc417fc802 2 days ago | parent [-]

What I find myself wondering about is the performance impacts of cross-thread communication in this scenario. With the nested domains it seems like there should be different (and increasingly severe) performance implications for crossing each distinct boundary. Whereas the languages we write in and the programming models we employ don't seem particularly well suited to expressing how we want our code to adapt to such constraints at present, at least not in a generalized manner.

I realize that HPC code can be customized to the specific device it will be run on but more widely deployed software is going to want to abstract these increasingly complex relationships.

DiabloD3 2 days ago | parent | next [-]

Its why, if you want high performance code in this sort of work, you'll want either C or C-like code. For example, learn how madvise() is used. Also, learn how thread local storage works in the context of implementing it on a hierarchical SMP system. Also, learn how to make a message passing system and what "atomic" means (locks are often not your friend here).

Ironically, a lot of people keep shooting themselves in the foot and blindly using MPI or OpenMP or any of the other popular industry supported frameworks, and then thinking that magically bails them out. It doesn't.

The most important thing you need, above all others: make sure the problem you're solving can be parallelized, and CPUs are the right way of doing it. Once you've answered this question, and the answer is yes, you can just write it pretty much normally.

Also, ironically, you can write Java that isn't shit and takes advantage of systems like these. Sun and the post-Sun community put a lot of work into the Hotspot JVM to make it scale alarmingly well on high core count machines. Java used correctly is performant.

bee_rider 2 days ago | parent | prev [-]

Chips and cheese did some measurements for previous AMD generations, they have a pretty core to core latency measurement a little bit after halfway down the page.

https://chipsandcheese.com/p/genoa-x-server-v-cache-round-2

fweimer a day ago | parent | prev [-]

There should be plenty of existing programming models that can be reused because HPC used single-image multi-hop NUMA systems a lot before the Beowulf clusters took over.

Even today, I think very large enterprise systems (where a single kernel runs on a single system that spans multiple racks) are built like this, too.