Remix.run Logo
tucnak a day ago

It's solely dependent on the workload's memory access patterns. The higher you go in thread count, the more you're constrained by contention, caches, etc. The paper in OP is demonstrating how relatively subtle differences in the memory model are leading to substantial differences in performance on actual hardware. The same as having lots of FLOPS on paper doesn't necessarily mean you'll get to use all that compute, if you're waiting on memory all the time. M-series processors have packaging advantage that is very hard to beat, and indeed, is yet to be beat—in consumer and prosumer segments.

See my reply to adjacent comment; hardware is not marketing, and LLM inference stands to witness.

Rohansi a day ago | parent [-]

> The same as having lots of FLOPS on paper doesn't necessarily mean you'll get to use all that compute, if you're waiting on memory all the time.

The opposite case is also possible. You can be compute limited. Or there could be bottlenecks somewhere else. This is definitely the case for Apple Silicon because you will certainly not be able to make use of all of the memory bandwidth from the CPU or GPU. As always, benchmark instead of looking at raw hardware specifications.