This is my favorite type of HN post, and definitely going to be a classic in the genre for me.

> Memory optimization on ultra-high core count systems differs a lot from single-threaded memory management. Memory allocators themselves become contention points, memory bandwidth is divided across more cores, and allocation patterns that work fine on small systems can create cascading performance problems at scale. It is crucial to be mindful of how much memory is allocated and how memory is used.

In bioinformatics, one of the most popular alignment algorithms is roughly bottlenecked on random RAM access (the FM-index on the BWT of the genome), so I always wonder how these algorithms are going to perform on these beasts. It's been a decade since I spent any time optimizing large system performance for it though. NUMA was already challenging enough! I wonder how many memory channels these new chips have access to.

▲

bob1029 3 days ago | parent | next [-]

The most ideal arrangement is one in which you do not need to use the memory subsystem in the first place. If two threads need to communicate back-forth with each other in a very tight loop in order to get some kind of job done, there is almost certainly a much faster technique that could be ran on a single thread. Physically moving the information between the cores of processing is the most expensive part. You can totally saturate the memory bandwidth of a Zen chip with somewhere around 8-10 cores if they're all going at a shared working set really aggressively.

Core-to-Core communication across infinity fabric is on the order of 50~100x slower than L1 access. Figuring out how to arrange your problem to meet this reality is the quickest path to success if you intend to leverage this kind of hardware. Recognizing that your problem is incompatible can also save you a lot of frustration. If your working sets must be massive monoliths and hierarchical in nature, it's unlikely you will be able to use a 256+ core monster part very effectively.

	▲	jeffbee 3 days ago \| parent \| next [-]
		Note that none of the CPUs in the article have that Zen architecture. One of the most interesting and poorly exploited features of these new Intel chips is that four cores share an L2 cache, so cooperation among 4 threads can have excellent efficiency. They also have user-mode address monitoring, which should be awesome for certain tricks, but unfortunately like so many other ISA extentions, it doesn't work. https://www.intel.com/content/www/us/en/developer/articles/t...
	▲	Moto7451 3 days ago \| parent \| prev [-]
		One of the use cases for Clickhouse and related columnar stores is simply to process all your data as quickly as possible where “all” is certainly more than what will fit in memory and in some cases more than what will fit on a single disk. For these I’d expect the allocator issue is contention when working with the MMU, TLB, or simply allocators that are not lock free (like the standard glibc allocator). Where possible one trick is to pre-allocate as much as possible for your worker pool so you get that out of the way and stop calling malloc once you begin processing. If you can swing it you replace chunks of processed data with new data within the same allocated area. At a previous job our custom search engine did just this to scale out better on the AWS X1 instances we were using for processing data.

▲

ashvardanian 3 days ago | parent | prev [-]

My expectation, they will perform great! I’m now mostly benchmarking on 192 core Intel, AMD, and Arm instances on AWS, and in some workloads they come surprisingly close to GPUs even on GPU-friendly workloads, once you get the SIMD and NUMA pinning parts right.

For BioInformatics specifically, I’ve just finished benchmarking Intel SPR 16-core UMA slices against Nvidia H100, and will try to extend them soon: https://github.com/ashvardanian/StringWa.rs