Remix.run Logo
menaerus 2 days ago

> Apple Silicon beats it by a factor of 5x

Really, 1TB/s of memory bandwidth to and from system memory?

I don't believe it since that's impossible from HW limits PoV - there's no such DRAM that would allow such performance and Apple doesn't design their memory sticks ...

It is also no more special with their 512-, 768- or 1024-bit memory interface since this is also not designed by them nor it is exclusively reserved to Apple. Intel has it. AMD has it as well.

However, regardless of that, and regardless of the way how you're the one skewing the facts, I would be happy to see the benchmark that shows, for example, a sustained load bandwidth of 1TB/s. Do you have one since I couldn't find it?

> You can get somewhat better with Turin

High-end Intel/AMD server-grade CPUs can achieve a system memory bandwidth of 600-700GB/s. So not somewhat better but 3x better.

namibj a day ago | parent | next [-]

Power 10 offers that much for a while now. Per-socket. And you can join up to iirc 16 sockets together into a coherent single-linux-kernel machine.

menaerus a day ago | parent [-]

Not sure which part of my comment you were referring to but if it was about the 1TB/s of mem bw it seems it is rather 409GB/s per-socket.

From https://www.ibm.com/support/pages/ibm-aix-power10-performanc...

> The Power10 processor technology introduces the new OMI DIMMs to access main memory. This allows for increased memory bandwidth of 409 GB/s per socket. The 16 available high-speed OMI links are driven by 8 on-chip memory controller units (MCUs), providing a total aggregated bandwidth of up to 409 GBps per SCM. Compared to the Power9 processor-based technology capability, this represents a 78% increase in memory bandwidth.

And that is again a theoretical limit which usually isn't that interesting but rather it's the practical limit the CPU is able to hit.

tucnak 2 days ago | parent | prev | next [-]

You're right, I looked it up, the hardware limit is actually 800 GB/s for M2 Ultra. You're also right that the actual bandwidth in real workloads is typically lower than that due to the aforementioned idiosyncrasies in caches, message-passing, prefetches, or lack thereof, etc. The same is the case for any high-end Intel/AMD CPU, though. If you wish to compare benchmarks, a single most relevant benchmark today is LLM inference, where M-series chips are a contender to beat. This is almost entirely due to combination of high-bandwidth, high-capacity (192 GB) on-package DRAM, available to all CPU and GPU cores. The closest x86 contender is AMD Strix Halo, and it's only somewhat competitive in high-sparsity, small MoE setups. NVIDIA were going to produce a desktop one based on their Grace superchip, but it turned out to be a big nothing.

Now, I'm not sure whether it's genuine to compare Apple Silicon to AMD's Turin architecture, where 600 GB/s is theoretically possible, considering at this point you're talking about 5K euro CPU with a smudge under 600W TDP. This is why I brought up Sienna, specifically, which is giving comparable performance in comparable price bracket and power envelope. Have you seen how much 12 channels of DDR5-6400 would set you back? The "high-end AMD server-grade," to borrow your words, system—would set you back 10K at a minimum, and it would still have zero GPU cores, and you would still have a little less memory bandwidth than a three-year old M2 Ultra.

I own both a Mac studio, and a Sienna-based AMD system.

There are valid reasons to go for x86, mainly it's PCIe lanes, various accelerator cards, MCIO connectivity for NVMe stuff, hardware IOMMU, SR-IOV networking and storage, in fact, anything having to do with hardware virtualisation. This is why people get "high-end" x86 CPU's, and indeed, this is why I used Sienna for the comparison as it's at least comparable in terms of price. And not some abstract idea of redline performance, where x86 CPU's by the way absolutely suck in a single most important general purpose task, i.e. LLM inference. If you were going for the utmost bit of oompf, you would go for a superchip anyway. So your choice is not even whether you're getting a CPU, instead it's how big and wide you wish your APU cluster to be, and what you're using for interconnect, as it's the largest contributing factor to your setup.

Update: I was unfair in my characterisation of NVIDIA DGX Spark as "big nothing," as despite its shortcomings, it's a fascinating platform in terms of connectivity: the first prosumer motherboard to natively support 200G, if I'm not mistaken. Now, you could always use a ConnectX-6 in your normal server's PCIe 5.0 slot, but that would already set you back many thousands of euros for datacenter-grade server specs.

rowanG077 2 days ago | parent | prev [-]

> Really, 1TB/s of memory bandwidth to and from system memory?

5x is false, it's more like 4x. Apple doesn't use memory sticks, they use on-SoC dram ICs.

The M3 Ultra has 8 memory channels at 128-bit per channel for a total of 1024-bit memory bus. It uses LPDDR5-6400 so it has 1024-bit * 6400000000 bits = 819.2 gigabytes per second of memory bandwidth.

menaerus 2 days ago | parent [-]

You're deceiving yourself and falling for Apple marketing. Regardless of a stick or SoC memory, which has been the case with pretty much SoC in 2010's (nowadays I have no idea), it is not possible to drive the memory with such high speeds.

rowanG077 2 days ago | parent [-]

This is definitely citation needed. I very much expect a combined GPU/CPU/NPU load to saturate the memory channels if necessary. This is not some marketing fluff. The channels are real, the number of RAM ICs are physically there and connected.

menaerus a day ago | parent | next [-]

We are talking about the memory bandwidth available to the CPU cores and not all the co-processors/accelerators present in the SoC so you're pulling in the argument that is not valid.

https://web.archive.org/web/20240902200818/https://www.anand...

> While 243GB/s is massive, and overshadows any other design in the industry, it’s still quite far from the 409GB/s the chip is capable of.

> That begs the question, why does the M1 Max have such massive bandwidth? The GPU naturally comes to mind, however in my testing, I’ve had extreme trouble to find workloads that would stress the GPU sufficiently to take advantage of the available bandwidth.

> Granted, this is also an issue of lacking workloads, but for actual 3D rendering and benchmarks, I haven’t seen the GPU use more than 90GB/s (measured via system performance counters)

inkyoto a day ago | parent | next [-]

The cited article is pretty clear: the M1 Max maxes out at (approximately) 100 Gb/sec per a single CPU core, 243 Gb/sec per a CPU cluster, and 409 Gb/sec per the entire SOC.

They did not (or, rather, could not) measure the theoretical peak GPU core saturation for the M1 Max SOC because such benchmarks did not exist at the time due to the sheer novelty of such wide hardware.

menaerus a day ago | parent [-]

> The cited article is pretty clear: the M1 Max maxes out at (approximately) 100 Gb/sec per a single CPU core, 243 Gb/sec per a CPU cluster, and 409 Gb/sec per the entire SOC.

So, which part of "We are talking about the memory bandwidth available to the CPU cores and not all the co-processors/accelerators present in the SoC" you didn't understand?

rowanG077 a day ago | parent | prev [-]

Well I think we talked about memory channels and the maximum speed reachable. And you claimed it was marketing fluff. I don't think it's unreasonable to say that if that speed is reachable using some workload it's not marketing fluff. It was not clear at all to me you limited your claims to CPU speed only. Seems like a classic motte-and-bailey to me.

Rohansi a day ago | parent | prev [-]

https://web.archive.org/web/20250125040351/anandtech.com/sho...

You're realistically going to reach power/thermal limits before you saturate the memory bandwidth. Otherwise I'd like to hear about a workload that'll make use of the CPU, GPU, NPU, etc. to make use of Apple's marketing point.