▲ | majke 7 months ago | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
This has puzzled me for a while. The cited system has 2x89.6 GB/s bandwidth. But a single CCD can do at most 64GB/s of sequential reads. Are claims like "Apple Silicon having 400GB/s" meaningless? I understand a typical single logical CPU can't do more than 50-70GB/s, and it seems like a group of CPU's typically shares a mem controller which is similarly limited. To rephrase: is it possible to cause 100% mem bandwith utilization with only or 1 or 2 CPU's doing the work per CCD? | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
▲ | ryao 7 months ago | parent | next [-] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
On Zen 3, I am able to use nearly the full 51.2GB/sec from a single CPU core. I have not tried using two as I got so close to 51.2GB/sec that I had assumed that going higher was not possible. Off the top of my head, I got 49-50GB/sec, but I last measured a couple years ago. By the way, if the cores were able to load things at full speed, they would be able to use 640GB/sec each. That is 2 AVX-512 loads per cycle at 5GHz. Of course, they never are able to do this due to memory bottlenecks. Maybe Intel’s Xeon Max series with HBM can, but I would not be surprised to see an unadvertised internal bottleneck there too. That said, it is so expensive and rare that few people will ever run code on one. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
▲ | jeffbee 7 months ago | parent | prev | next [-] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
There are large differences in load/store performance across implementations. On Apple Silicon for example the M1 Max a single core can stream about 100GB/s all by itself. This is a significant advantage over competing designs that are built to hit that kind of memory bandwidth only with all-cores workloads. For example five generations of Intel Xeon processors, from Sandybridge through Skylake, were built to achieve about 20GB/s streams from a single core. That is one reason why the M1 was so exceptional at the time it was released. The 1T memory performance is much better than what you get from everyone else. As far as claims of the M1 Max having > 400GB/s of memory bandwidth, this isn't achievable from CPUs alone. You need all CPUs and GPUs running full tilt to hit that limit. In practice you can hit maybe 250GB/s from CPUs if you bring them all to bear, including the efficiency cores. This is still extremely good performance. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
▲ | KeplerBoy 7 months ago | parent | prev | next [-] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Aren't those 400 GB/s a figure which only apply when the GPU with its much wider interface is accessing the memory? | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
▲ | jmb99 7 months ago | parent | prev | next [-] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
> The cited system has 2x89.6 GB/s bandwidth. The following applies for certain only to the Zen4 system; I have no experience with Zen5. That is the theoretical max bandwidth of the DDR5 memory (/controller) running at 5600 MT/s (roughly: 5600MT/s ÷ 2MT/s × 32 bits/T = 89.6GB/s). There is also a bandwidth limitation between the memory controller (IO die) and the cores themselves (CCDs), along the Infinity Fabric. Infinity Fabric runs at a different clock speed than the cores, their cache(s), and the memory controller; by default, 2/3 of the memory controller. So, if the Memory controller's CLocK (MCLK) is 2800MHz (for 5600MT/s), the FCLK (infinity Fabrick CLocK) will run at 1866.66MHz. With 32 bytes per clock read bandwidth, you get 59.7GB/s maximum sequential memory read bandwidth per CCD<->IOD interconnect. Many systems (read: motherboard manufacturers) will overclock the FCLK when applying automatic overclocking (such as when selecting XMP/EXPO profiles, and I believe some EXPO profiles include overclocking the FCLK as well. (Note that 5600MT/s RAM is overclocked; the fastest officially supported Zen4 memory speed is 5200MT/s, and most memory kits are 3600MT/s or less until overclocked with their built-in profiles.) In my experience, Zen4 will happily accept FCLK up to 2000MHz, while Zen4 Threadripper (7000 series) seems happy up to 2200MHz. This particular system has the FCLK overclocked to 2000MHz, which will hurt latency[0] (due to not being 2/3 of MCLK) but increase bandwidth. 2000MHz × 32 bytes/cycle = 64GB/s read bandwidth, as quoted in the article. First: these are theoretical maximums. Even the most "perfect" benchmark won't hit these, and if they do, there are other variables at play not being taken into account (likely lower level caches). You will never, ever see theoretical maximum memory bandwidth in any real application. Second: no, it is not possible to see maximum memory bandwidth on Zen4 from only one CCD, assuming you have sufficiently fast DDR5 that the FCLK cannot be equal to the MCLK. This is an architecture limitation, although rarely hit in practice for most of the target market. A dual-CCD chip has sufficient memory bandwidth to saturate the memory before the Infinity Fabric (but as alluded to in the article, unless tuned incredibly well, you'll likely run into contention issues and either hit a latency or bandwidth wall in real applications). My quad-CCD Threadripper can achieve nearly 300GB/s, due to having 8 (technically 16) DDR5 channels operating at 5800MT/s and FCLK at 2200MHz; I would need an octo-CCD chip to achieve maximum memory bandwidth utilization. Third: no, claims like "Apple Silicon having 400GB/s) are not meaningless. Those numbers are achieved the exact same way as above, and the same way Nvidia determines their maximum memory bandwidth on their GPUs. Platform differences (especially CPU vs GPU, but even CPU vs CPU since Apple, AMD, and Intel all have very different topologies) make the numbers incomparable to each other directly. As an example, Apple Silicon can probably achieve higher per-core memory bandwidth than Zen4 (or 5), but also shares bandwidth with the GPU; this may not be great for gaming applications, for instance, where memory bandwidth requirements will be high for both the CPU and GPU, but may be fine for ML inference since the CPU sits mostly idle while the GPU does most of the work. [0] I'm surprised the author didn't mention this. I can only assume they didn't know this, and haven't tested over frequencies or read much on the overclocking forums about Zen4. Which is fair enough, it's a very complicated topic with a lot of hidden nuances. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
▲ | neonsunset 7 months ago | parent | prev [-] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Easily, the memory subsystem on AMDs consumer parts is embarrassingly weak (on all desktop and portable consumer devices in general save for Apple ones and select bespoke designs). |