There's a constant drum-beat of NUMA related work going by if you follow phoronix.com .

https://www.phoronix.com/news/Linux-6.17-NUMA-Locality-Rando... https://www.phoronix.com/news/Linux-6.13-Sched_Ext https://www.phoronix.com/news/DAMON-Self-Tuned-Memory-Tierin... https://www.phoronix.com/news/Linux-6.14-FUSE

There's some big work I'm missing thats more recent too, again about allocating & scheduling IIRC. Still trying to find it. The third link is in DAMON, which is trying to do a lot to optimize; good thread to tug more on!

I have this pocket belief that eventually we might see post NUMA post coherency architectures, where even a single chip acts more like multiple independent clusters, that use something more like networking (CXL or UltraEthernet or something) to allow RDMA, but without coherency.

Even today, the title here is woefully under-describing the problem. A Epyc chip is actually multiple different compute die, each with their own NUMA zone and their own L3 and other caches. For now yes each socket's memory is all via a single IO die & semi uniform, but whether that holds is in question, and even today, the multiple NUMA zones on one socket already require careful tuning for efficient workload processing.

▲

Aurornis 3 days ago | parent | next [-]

Emulating NUMA on a single chip is already a known performance tweak on certain architecture. There are options in place to enable it: https://www.kernel.org/doc/html/v5.8/x86/x86_64/fake-numa-fo...

Even the Raspberry Pi 5 benefits from NUMA emulation because it makes memory use patterns better match the memory controller’s parallelization capabilities.

▲

positron26 3 days ago | parent | prev [-]

IMO, matter of time before x86 or RISCV extension will show up to begin the inevitable unification of GPU and SIMD in an ISA. NUMA work and clustering over CCXs and sockets is paving the way for the software support in the OS. Question is what makes as much of Vulkan, OpenCL, and CUDA go away as possible?

▲

jauntywundrkind 3 days ago | parent [-]

The vector based simd of RISC-V is very neat. Very hard but also very neat. Rather than having fixed instructions for specific "take 4 fp32 and multiply by 3 fp32" then needing a new instruction for fp64 them a new one for fp32 x fp64 them a new one for 4 x 4, it generalizes the instructions to be more data shape agnostic: here's a cross product operation, you tell us what the vector lengths are going to be, let the hardware figure it out.

I also really enjoyed Semantic Streaming Registers paper, which makes load/store implicit in some ops, adds counters that can walk forward and back automatically so that you can loop immediately and start the next element, have the results dropped into the next result slot. This enables near DSP levels of instruction density, to be more ops focused rather than having to spend instructions writing and saving each step. https://www.research-collection.ethz.ch/bitstream/20.500.118...

I still have a bit of a hard time seeing how we bridge CPU and GPU. The whole "single program multiple executor" waves aspect of the GPU is spiritually just launching a bunch of tasks for a job, but I still struggle to see an eventual convergence point. The GPU remains a semi mystical device to me.

▲

jandrewrogers 3 days ago | parent [-]

The variable length vectors are probably one of those ideas that sound good on paper but don’t work that well in practice. The issue is that you actually do need to know the vector register size in order to properly design and optimize your data structures.

Most advanced uses of e.g. AVX-512 are not just doing simple loop-unrolling style parallelism. They are doing non-trivial slicing and dicing of heterogeneous data structures in parallel. There are idioms that allow you to e.g. process unrelated predicates in parallel using vector instructions, effectively MIMD instead of SIMD. It enables use of vector instructions more pervasively than I think people expect but it also means you really need to know where the register boundaries are with respect to your data structures.

History has generally shown that when it comes to optimization, explicitness is king.

	▲	camel-cdr 3 days ago \| parent \| next [-]
		> The variable length vectors are probably one of those ideas that sound good on paper but don’t work that well in practice I don't understand this take, you can still querry the vector length and have specialized implementations if needed. But the vast majority of cases can be written in a VLA way, even most advanced ones imo. E.g. here are a few things that I know to work well in a VLA style: simdutf (upstream), simdjson (I have a POC), sorting (I would still specialize, but you can have a fast generic fallback), jpeg decoding, heapify, ...
	▲	positron26 a day ago \| parent \| prev [-]
		This might be a case where -mtune and -march or just runtime patching become more important