In AVX-512 we have a platform that rewards the assembly language programmer like few platforms have since the 6502. I see people doing really clever things that are specific to the system and one level it is really cool but on another level it means SIMD is the domain of the specialist, Intel puts out press releases about the really great features they have for the national labs and for Facebook whereas the rest of us are 5-10 years behind the curve for SIMD adoption because the juice isn't worth the squeeze.

Just before libraries for training neural nets on GPUs became available I worked on a product that had a SIMD based neural network trainer that was written in hand-coded assembly. We were a generation behind in our AVX instructions so we gave up half of the performance we could have got, but that was the least of the challenges we had to overcome to get the product in front of customers. [1]

My software-centric view of Intel's problems is that they've been spending their customers and shareholders money to put features in chips that are fused off or might as well be fused off because they aren't widely supported in the industry. And that they didn't see this as a problem and neither did their enablers in the computing media and software industry. Just for example, Apple used to ship the MKL libraries which like a turbocharger for matrix math back when they were using Intel chips. For whatever reason, Microsoft did not do this with Windows and neither did most Linux distributions so "the rest of us" are stuck with a fraction of the performance that we paid for.

AMD did the right thing in introducing double pumped AVX-512 because at least assembly language wizards have some place where their code runs and the industry gets closer to the place where we can count on using an instruction set defined 12 years ago.

[1] If I'd been tasked with updating the to next generation I would have written a compiler (if I take that many derivatives by hand I'll get one wrong.) My boss would have ordered me not to, I would have done it anyway and not checked it in.

▲

ack_complete 3 months ago | parent | next [-]

AVX-512 also has a lot of wonderful facilities for autovectorization, but I suspect its initial downclocking effects plus getting yanked out of Alder Lake killed a lot of the momentum in improving compiler and library usage of it.

Even the Steam Hardware Survey, which is skewed toward upper end hardware, only shows 16% availability of baseline AVX-512, compared to 94% for AVX2.

▲

adgjlsfhk1 3 months ago | parent | next [-]

It will be interesting seeing what happens now that AMD is shipping good AVX-512. It really just makes Intel seem incompetent (especially since they're theoretically bringing AVX-512 back in next year anyway)

▲

ack_complete 3 months ago | parent [-]

No proof, but I suspect that AMD's AVX-512 support played a part in Intel dumping AVX10/256 and changing plans back to shipping a full 512-bit consumer implementation again (we'll see when they actually ship it).

The downside is that AMD also increased the latency of all formerly cheap integer vector ops. This removes one of the main advantages against NEON, which historically has had richer operations but worse latencies. That's one thing I hope Intel doesn't follow.

Also interesting is that Intel's E-core architecture is improving dramatically compared to the P-core, even surpassing it in some cases. For instance, Skymont finally has no penalty for denormals, a long standing Intel weakness. Would not be surprising to see the E-core architecture take over at some point.

▲

adgjlsfhk1 3 months ago | parent [-]

> For instance, Skymont finally has no penalty for denormals, a long standing Intel weakness.

yeah, that's crazy to me. Intel has been so completely discunctional for the last 15 years. I feel like you couldn't have a clearer sign of "we have 2 completely separate teams that are competing with each other and aren't allowed to/don't want to talk to each other". it's just such a clear sign that the chicken is running around headless

▲

whizzter 3 months ago | parent [-]

Not really, to me it more seems like Pentium-4 vs Pentium-M/Core again.

The downfall of Pentium 4 was that they had been stuffing things into longer and longer pipes to keep up the frequency race(with horrible branch latencies as a result). They scaled it all away by "resetting" to the P3/P-M/Core architecture and scaling up from that again.

Pipes today are even _longer_ and if E-cores has shorter pipes at a similar frequency then "regular" JS,Java,etc code will be far more performant even if you lose a bit of perf for "performance" cases where people vectorize (Did the HPC computing crowd influence Intel into a ditch AGAIN? wouldn't be surprising!).

	▲	ack_complete 3 months ago \| parent [-]
		Thankfully, the P-cores are nowhere near as bad as the Pentium 4 was. The Pentium 4 had such a skewed architecture that it was annoyingly frustrating to optimize for. Not only was the branch misprediction penalty long, but all common methods of doing branchless logic like conditional moves were also slow. It also had a slow shifter such that small left shifts were actually faster as sequences of adds, which I hadn't needed to do since the 68000 and 8086. And an annoying L1 cache that had 64K aliasing penalties (guess which popular OS allocates all virtual memory, particularly thread stacks, at 64K alignment.....) The P-cores have their warts, but are still much more well-rounded than the P4 was.

▲

ezekiel68 3 months ago | parent | prev [-]

You mentioned "initial downclocking effects", yet (for posterity) I want to emphasize that in 2020 Ice Lake (Sunny Cove core) and later Intel processors, the downclocking is really a nothingburger. The fusing off debacle in desktop CPU families like Alder Lake you mentioned definitely killed the momentum though.

I'm not sure why OS kernels couldn't have become partners in CPU capability queries (where a program starting execution could request a CPU core with 'X' such as AVX-512F, for example) -- but without that the whole P-core/E-core hybrid concept was DOA for capabilities which were not least-common denominators. If I had to guess, marketing got ahead of engineering and testing on that one.

	▲	ack_complete 3 months ago \| parent [-]
		Sure, but any core-wide downclocking effect at all is annoying for autovectorization, since a small local win easily turns into a global loss. Which is why compilers have "prefer vector width" tuning parameters so autovec can be tuned down to avoid 512-bit or even 256-bit ops. This is also the same reason that having AVX-512 only on the P-cores wouldn't have worked, even with thread director support. It would only take one small routine in a common location to push most threads off the P-cores. I'm of the opinion that Intel's hybrid P/E-arch has been mostly useless anyway and only good for winning benchmarks. My current CPU has a 6P4E configuration and the scheduler hardly uses the E-cores at all unless forced, plus performance was better and more stable with the E-cores disabled.

▲

the__alchemist 3 months ago | parent | prev | next [-]

Noob question! What about AVX-512 makes it unique to assembly programmers? I'm just dipping my toes in, and have been doing some chemistry computations using f32x8, Vec3x8 etc (AVX-256). I have good workflows set up, but have only been getting 2x speedup over non-SIMD. (Was hoping for closer to 8). I figured AVX-512 would allow f32x16 etc, which would be mostly a drop-in. (I have macros to set up the types, and you input num lanes).

	▲	ack_complete 3 months ago \| parent \| next [-]
		AVX-512 has a lot of instructions that just extend vectorization to 512-bit and make it nicer with features like masking. Thus, a very valid use of it is just to double vectorization width. But it also has a bunch of specialized instructions that can boost performance beyond just the 2x width. One of them is VPCOMPRESSB, which accelerates compact encoding of sparse data. Another is GF2P8AFFINEQB, which is targeted at specific encryption algorithms but can also be abused for general bit shuffling. Algorithms like computing a histogram can benefit significantly, but it requires reshaping the algorithm around very particular and peculiar intermediate data layouts that are beyond the transformations a compiler can do. This doesn't literally require assembly language, though, it can often be done with intrinsics.
	▲	dzaima 3 months ago \| parent \| prev [-]
		SIMD only helps you where you're arithmetic-limited; you may be limited by memory bandwidth, or perhaps float division if applicable; and if your scalar comparison got autovectorized you'd have roughly no benefit. AVX-512 should be just fine via intrinsics/high-level vector types, not different from AVX2 in this regard.

▲

bee_rider 3 months ago | parent | prev [-]

It is kind of a bummer that MKL isn’t open sourced, as that would make inclusion in Linux easier. It is already free-as-in-beer, but of course that doesn’t solve everything.

Baffling that MS didn’t use it. They have a pretty close relationship…

Agree that they are sort of going after hard-to-use niche features nowadays. But I think it is just that the real thing we want—single threaded performance for branchy code—is, like, incredibly difficult to improve nowadays.

▲

PaulHoule 3 months ago | parent [-]

At the very least you can decode UTF-8 really quickly with AVX-512

https://lemire.me/blog/2023/08/12/transcoding-utf-8-strings-...

and web browsers at the very least spent a lot of cycles on decoding HTML and Javascript which is UTF-8 encoded. It turns out AVX-512 is good at a lot of things you wouldn't think SIMD would be good at. Intel's got the problem that people don't want to buy new computers because they don't see much benefit from buying a new computer, but a new computer doesn't have the benefit it could have because of lagging software support, and the software support lags because there aren't enough new computers to justify the work to do the software support. Intel deserves blame for a few things, one of which is that they have dragged their feet at getting really innovative features into their products while turning people off with various empty slogans.

They really do have a new instruction set that targets plain ordinary single threaded branchy code

https://www.intel.com/content/www/us/en/developer/articles/t...

they'll probably be out of business before you can use it.

▲

immibis 3 months ago | parent | next [-]

If you pay attention this isn't a UTF-8 decoder. It might be some other encoding, or a complete misunderstanding of how UTF-8 works, or an AI hallucination. It also doesn't talk about how to handle the variable number of output bytes or the possibility of a continuation sequence split between input chunks.

▲

kjs3 3 months ago | parent [-]

I paid attention and I don't see where Daniel claimed that this a complete UTF-8 decoder. He's illustrating a programming technique using a simplified use case, not solving the worlds problems. And I don't think Daniel Lemire lacks an understanding of the concept or needs an AI to code it.

▲

magicalhippo 3 months ago | parent [-]

Agreed, but the points raised by GP are valid in terms of using that article as an argument that AVX-512 can decode UTF-8 well.

It might be fast, but it's not a UTF-8 decoder. It's a transcoder to a fixed, and very limited, target encoding.

	▲	kjs3 2 months ago \| parent [-]
		I though it was pretty clear the GP was talking about Daniels article, not the blog post, but I guess I can see two readings.

▲

gatane 3 months ago | parent | prev [-]

In the end, it doesnt even matter, javascript frameworks are already big enough to slow down your pc.

Unless if said optimization on parsing runs at the very core of JS.

	▲	saagarjha 3 months ago \| parent [-]
		It'll speed up first load times.