▲ | Rohansi a day ago | |||||||||||||
Memory bandwidth is just a marketing term for Apple at this point. Sure, the bus is capable of reaching that bandwidth, but how much can your code actually use? You'd be mistaken if you think the CPU can make use of all that bandwidth, or even the GPU! | ||||||||||||||
▲ | inkyoto a day ago | parent | next [-] | |||||||||||||
> […] but how much can your code actually use? All of it, and it is transparent to the code. The correct question is «how much data does the code transfer?» Whether you are scanning large string ropes for a lone character or multiplying huge matrices, no manual code optimisation is required. | ||||||||||||||
| ||||||||||||||
▲ | tucnak a day ago | parent | prev [-] | |||||||||||||
It's solely dependent on the workload's memory access patterns. The higher you go in thread count, the more you're constrained by contention, caches, etc. The paper in OP is demonstrating how relatively subtle differences in the memory model are leading to substantial differences in performance on actual hardware. The same as having lots of FLOPS on paper doesn't necessarily mean you'll get to use all that compute, if you're waiting on memory all the time. M-series processors have packaging advantage that is very hard to beat, and indeed, is yet to be beat—in consumer and prosumer segments. See my reply to adjacent comment; hardware is not marketing, and LLM inference stands to witness. | ||||||||||||||
|