Memory bandwidth is just a marketing term for Apple at this point. Sure, the bus is capable of reaching that bandwidth, but how much can your code actually use? You'd be mistaken if you think the CPU can make use of all that bandwidth, or even the GPU!

▲

inkyoto a day ago | parent | next [-]

> […] but how much can your code actually use?

All of it, and it is transparent to the code. The correct question is «how much data does the code transfer?»

Whether you are scanning large string ropes for a lone character or multiplying huge matrices, no manual code optimisation is required.

	▲	Rohansi a day ago \| parent \| next [-]
		Have you tested it or is that just what you expect?
	▲	tucnak a day ago \| parent \| prev [-]
		Are you well-read enough into the platform so that you can attest to it requiring no manual code optimisation for high-performance datapaths? I'm only familiar with Apple Silicon-specific code in llama.cpp, and not really familiar with either Accelerate[0] or MLX[1] specifically. Have they really cracked it at homogenous computing so that you could use a single description of computation, and have it emit efficient code for whatever target in the SoC? Or are you merely referring to the full memory capacity/bandwidth being available to CPU in normal operation? [0]: https://developer.apple.com/documentation/accelerate [1]: https://ml-explore.github.io/mlx/build/html/usage/quick_star...

▲

tucnak a day ago | parent | prev [-]

It's solely dependent on the workload's memory access patterns. The higher you go in thread count, the more you're constrained by contention, caches, etc. The paper in OP is demonstrating how relatively subtle differences in the memory model are leading to substantial differences in performance on actual hardware. The same as having lots of FLOPS on paper doesn't necessarily mean you'll get to use all that compute, if you're waiting on memory all the time. M-series processors have packaging advantage that is very hard to beat, and indeed, is yet to be beat—in consumer and prosumer segments.

See my reply to adjacent comment; hardware is not marketing, and LLM inference stands to witness.

	▲	Rohansi a day ago \| parent [-]
		> The same as having lots of FLOPS on paper doesn't necessarily mean you'll get to use all that compute, if you're waiting on memory all the time. The opposite case is also possible. You can be compute limited. Or there could be bottlenecks somewhere else. This is definitely the case for Apple Silicon because you will certainly not be able to make use of all of the memory bandwidth from the CPU or GPU. As always, benchmark instead of looking at raw hardware specifications.