▲ | menaerus 2 days ago | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
> Apple Silicon beats it by a factor of 5x Really, 1TB/s of memory bandwidth to and from system memory? I don't believe it since that's impossible from HW limits PoV - there's no such DRAM that would allow such performance and Apple doesn't design their memory sticks ... It is also no more special with their 512-, 768- or 1024-bit memory interface since this is also not designed by them nor it is exclusively reserved to Apple. Intel has it. AMD has it as well. However, regardless of that, and regardless of the way how you're the one skewing the facts, I would be happy to see the benchmark that shows, for example, a sustained load bandwidth of 1TB/s. Do you have one since I couldn't find it? > You can get somewhat better with Turin High-end Intel/AMD server-grade CPUs can achieve a system memory bandwidth of 600-700GB/s. So not somewhat better but 3x better. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
▲ | namibj a day ago | parent | next [-] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
Power 10 offers that much for a while now. Per-socket. And you can join up to iirc 16 sockets together into a coherent single-linux-kernel machine. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
▲ | tucnak 2 days ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
You're right, I looked it up, the hardware limit is actually 800 GB/s for M2 Ultra. You're also right that the actual bandwidth in real workloads is typically lower than that due to the aforementioned idiosyncrasies in caches, message-passing, prefetches, or lack thereof, etc. The same is the case for any high-end Intel/AMD CPU, though. If you wish to compare benchmarks, a single most relevant benchmark today is LLM inference, where M-series chips are a contender to beat. This is almost entirely due to combination of high-bandwidth, high-capacity (192 GB) on-package DRAM, available to all CPU and GPU cores. The closest x86 contender is AMD Strix Halo, and it's only somewhat competitive in high-sparsity, small MoE setups. NVIDIA were going to produce a desktop one based on their Grace superchip, but it turned out to be a big nothing. Now, I'm not sure whether it's genuine to compare Apple Silicon to AMD's Turin architecture, where 600 GB/s is theoretically possible, considering at this point you're talking about 5K euro CPU with a smudge under 600W TDP. This is why I brought up Sienna, specifically, which is giving comparable performance in comparable price bracket and power envelope. Have you seen how much 12 channels of DDR5-6400 would set you back? The "high-end AMD server-grade," to borrow your words, system—would set you back 10K at a minimum, and it would still have zero GPU cores, and you would still have a little less memory bandwidth than a three-year old M2 Ultra. I own both a Mac studio, and a Sienna-based AMD system. There are valid reasons to go for x86, mainly it's PCIe lanes, various accelerator cards, MCIO connectivity for NVMe stuff, hardware IOMMU, SR-IOV networking and storage, in fact, anything having to do with hardware virtualisation. This is why people get "high-end" x86 CPU's, and indeed, this is why I used Sienna for the comparison as it's at least comparable in terms of price. And not some abstract idea of redline performance, where x86 CPU's by the way absolutely suck in a single most important general purpose task, i.e. LLM inference. If you were going for the utmost bit of oompf, you would go for a superchip anyway. So your choice is not even whether you're getting a CPU, instead it's how big and wide you wish your APU cluster to be, and what you're using for interconnect, as it's the largest contributing factor to your setup. Update: I was unfair in my characterisation of NVIDIA DGX Spark as "big nothing," as despite its shortcomings, it's a fascinating platform in terms of connectivity: the first prosumer motherboard to natively support 200G, if I'm not mistaken. Now, you could always use a ConnectX-6 in your normal server's PCIe 5.0 slot, but that would already set you back many thousands of euros for datacenter-grade server specs. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
▲ | rowanG077 2 days ago | parent | prev [-] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
> Really, 1TB/s of memory bandwidth to and from system memory? 5x is false, it's more like 4x. Apple doesn't use memory sticks, they use on-SoC dram ICs. The M3 Ultra has 8 memory channels at 128-bit per channel for a total of 1024-bit memory bus. It uses LPDDR5-6400 so it has 1024-bit * 6400000000 bits = 819.2 gigabytes per second of memory bandwidth. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|