Remix.run Logo
ashvardanian 4 days ago

> All vector units have full 512 bits capabilities except for memory writes. A 512-bit vector write instruction is executed as two 256-bit writes.

That sounds like a weird design choice. Curious if this will affect memcpy-heavy workloads.

Writes aside, Zen5 is taking much longer to roll out than I thought, and some of AMD's positioning is (almost expectedly) misleading, especially around AI.

AMD's website claims Zen5 is the "Leading CPU for AI" (<https://www.amd.com/en/products/processors/server/epyc/ai.ht...>), but I strongly doubt that. First, they compare Zen5 (9965), which is still largely unavailable, to Xeon2 (8280), a 2 generations older processor. Xeon4 is abundantly available and comes with AMX, an exclusive feature to Intel. I doubt AVX-512 support with a 512-bit physical path and even twice as many cores will be enough to compete with that (if we consider just the ALU throughput rather than the overall system & memory).

dragontamer 4 days ago | parent | next [-]

Well, when you consider that AVX 512 instructions have 2 or 3 reads per 1 write, there's a degree of sense here.

Consider the standard matrix multiplication primitive the FMAC / multiply and accumulate: 3 reads and one write if I'm counting correctly .... (Output = A * B + C, three reads one output).

ryao 4 days ago | parent | prev | next [-]

AMD CPUs tend to have more memory bandwidth than Intel CPUs and inference is CPU bound, so their claim seems accurate to me.

Whether the core does a 512-bit write in 1 cycle or 2 because it is two 256-bit writes is immaterial. Memory bandwidth is bottlenecked by 64GB/sec per CCX. You need to use cores from multiple CCXs to get full bandwidth.

That said, the EYPC 9175F has 614.4GB/sec memory bandwidth and should be able to use all of it. I have one, although the machine is not yet assembled (Supermicro took 7 weeks to send me a motherboard, which delayed assembly), so I have no confirmed that it can use all of it yet.

ryao 4 days ago | parent | next [-]

> inference is CPU bound

This was a typo. It should have been “inference is memory bandwidth bound”.

menaerus 4 days ago | parent | prev | next [-]

Interesting design. 16 CCDs / 16 CCXs / 16 cores. 1 core per each CCD. 1 CCX per each CCD. With 512MB of L3 cache this CPU should be able to use ~all of its ~10 TB/s of L3 MBW out of the box.

How much is it going to cost you to build the box?

adgjlsfhk1 4 days ago | parent | prev [-]

you can use higher write bandwidth than the CCX bandwidth by having multiple writes that go to the same L2 address before going out to RAM

rpiguy 4 days ago | parent | prev | next [-]

It may be easier for the memory controller to schedule two narrower writes than waiting for one 512-bit block or perhaps they just didn't substantially update the memory controller and so it still has to operate as it did in Zen 4.

p_l 2 days ago | parent [-]

Zen 4 memory controllers operate preferably in multiplies of 512bits (single burst on 16n prefetch mode DDR5 channel, 4 channels on consumer Zen4 devices)

vient 4 days ago | parent | prev | next [-]

AMX is indeed a very strong feature for AI. I've compared Ryzen 9950X with w7-2495X using single-thread inference of some fp32/bf16 neural networks, and while Zen 5 is clearly better than Zen 4, Xeon is still a lot faster even considering that its frequency is almost 1GHz less.

Now, if we say "Zen5 is the leading consumer CPU for AI" then no objections can be made, consumer Intel models do not even support AVX-512.

Also, note that for inference they compare with Xeon 8592+ which is the top Emerald Rapids model. Not sure if comparison with Granite Rapids would have been more appropriate but they surely dodged the AMX bullet by testing FP32 precision instead of BF16.

reitzensteinm 4 days ago | parent | prev | next [-]

This is a misreading of their website. On the left, they compare the EPYC 9965 (launched 10/10/24) with the Xeon Platinum 8280 (launched Q2 '19) and make a TCO argument for replacing outdated Intel servers with AMD.

On the right, they compare the EPYC 9965 (launched 10/10/24) with the Xeon Platinum 8592+ (launched Q4 23), a like for like comparison against Intel's competition at launch.

The argument is essentially in two pieces - "If you're upgrading, you should pick AMD. If you're not upgrading, you should be."

ashvardanian 4 days ago | parent [-]

It’s true that they compare to different Intel CPUs in different parts of the webpage, and I don’t always understand the intentions behind those comparisons.

Still, if you decode the unreadable footnotes 2 & 3 in the bottom of the page - a few things stand out: avoiding AMX, using CPUs with different core-counts & costs, and even running on a different Linux kernel version, which may affect scheduling…

bcrl 3 days ago | parent | prev | next [-]

It's probably a design choice that is driven by power consumption. 512 bit writes are probably used rarely enough that the performance benefits do not outweigh the additional power consumption that would be borne by all memory writes.

arrakark 4 days ago | parent | prev [-]

Cache-line bursts/beats tend to be standardized to 64B in lots of NoC architectures.

p_l 2 days ago | parent | next [-]

64 byte cache line size matches 64byte single burst transaction on DDR3-5, and ganged dual channel transaction on DDR2. Matching those together means you have a nice 1-to-1 relationship between filling a cache line and single fast memory transaction

Dylan16807 4 days ago | parent | prev | next [-]

"Network on Chip" okay got it.

crest 4 days ago | parent | prev [-]

A 64B cache-line is the same size as an AVX-512 register.