Remix.run Logo
ack_complete 5 months ago

AVX-512 also has a lot of wonderful facilities for autovectorization, but I suspect its initial downclocking effects plus getting yanked out of Alder Lake killed a lot of the momentum in improving compiler and library usage of it.

Even the Steam Hardware Survey, which is skewed toward upper end hardware, only shows 16% availability of baseline AVX-512, compared to 94% for AVX2.

adgjlsfhk1 5 months ago | parent | next [-]

It will be interesting seeing what happens now that AMD is shipping good AVX-512. It really just makes Intel seem incompetent (especially since they're theoretically bringing AVX-512 back in next year anyway)

ack_complete 5 months ago | parent [-]

No proof, but I suspect that AMD's AVX-512 support played a part in Intel dumping AVX10/256 and changing plans back to shipping a full 512-bit consumer implementation again (we'll see when they actually ship it).

The downside is that AMD also increased the latency of all formerly cheap integer vector ops. This removes one of the main advantages against NEON, which historically has had richer operations but worse latencies. That's one thing I hope Intel doesn't follow.

Also interesting is that Intel's E-core architecture is improving dramatically compared to the P-core, even surpassing it in some cases. For instance, Skymont finally has no penalty for denormals, a long standing Intel weakness. Would not be surprising to see the E-core architecture take over at some point.

adgjlsfhk1 5 months ago | parent [-]

> For instance, Skymont finally has no penalty for denormals, a long standing Intel weakness.

yeah, that's crazy to me. Intel has been so completely discunctional for the last 15 years. I feel like you couldn't have a clearer sign of "we have 2 completely separate teams that are competing with each other and aren't allowed to/don't want to talk to each other". it's just such a clear sign that the chicken is running around headless

whizzter 5 months ago | parent [-]

Not really, to me it more seems like Pentium-4 vs Pentium-M/Core again.

The downfall of Pentium 4 was that they had been stuffing things into longer and longer pipes to keep up the frequency race(with horrible branch latencies as a result). They scaled it all away by "resetting" to the P3/P-M/Core architecture and scaling up from that again.

Pipes today are even _longer_ and if E-cores has shorter pipes at a similar frequency then "regular" JS,Java,etc code will be far more performant even if you lose a bit of perf for "performance" cases where people vectorize (Did the HPC computing crowd influence Intel into a ditch AGAIN? wouldn't be surprising!).

ack_complete 5 months ago | parent [-]

Thankfully, the P-cores are nowhere near as bad as the Pentium 4 was. The Pentium 4 had such a skewed architecture that it was annoyingly frustrating to optimize for. Not only was the branch misprediction penalty long, but all common methods of doing branchless logic like conditional moves were also slow. It also had a slow shifter such that small left shifts were actually faster as sequences of adds, which I hadn't needed to do since the 68000 and 8086. And an annoying L1 cache that had 64K aliasing penalties (guess which popular OS allocates all virtual memory, particularly thread stacks, at 64K alignment.....)

The P-cores have their warts, but are still much more well-rounded than the P4 was.

ezekiel68 5 months ago | parent | prev [-]

You mentioned "initial downclocking effects", yet (for posterity) I want to emphasize that in 2020 Ice Lake (Sunny Cove core) and later Intel processors, the downclocking is really a nothingburger. The fusing off debacle in desktop CPU families like Alder Lake you mentioned definitely killed the momentum though.

I'm not sure why OS kernels couldn't have become partners in CPU capability queries (where a program starting execution could request a CPU core with 'X' such as AVX-512F, for example) -- but without that the whole P-core/E-core hybrid concept was DOA for capabilities which were not least-common denominators. If I had to guess, marketing got ahead of engineering and testing on that one.

ack_complete 5 months ago | parent [-]

Sure, but any core-wide downclocking effect at all is annoying for autovectorization, since a small local win easily turns into a global loss. Which is why compilers have "prefer vector width" tuning parameters so autovec can be tuned down to avoid 512-bit or even 256-bit ops.

This is also the same reason that having AVX-512 only on the P-cores wouldn't have worked, even with thread director support. It would only take one small routine in a common location to push most threads off the P-cores.

I'm of the opinion that Intel's hybrid P/E-arch has been mostly useless anyway and only good for winning benchmarks. My current CPU has a 6P4E configuration and the scheduler hardly uses the E-cores at all unless forced, plus performance was better and more stable with the E-cores disabled.