| ▲ | PaulHoule a day ago |
| In AVX-512 we have a platform that rewards the assembly language programmer like few platforms have since the 6502. I see people doing really clever things that are specific to the system and one level it is really cool but on another level it means SIMD is the domain of the specialist, Intel puts out press releases about the really great features they have for the national labs and for Facebook whereas the rest of us are 5-10 years behind the curve for SIMD adoption because the juice isn't worth the squeeze. Just before libraries for training neural nets on GPUs became available I worked on a product that had a SIMD based neural network trainer that was written in hand-coded assembly. We were a generation behind in our AVX instructions so we gave up half of the performance we could have got, but that was the least of the challenges we had to overcome to get the product in front of customers. [1] My software-centric view of Intel's problems is that they've been spending their customers and shareholders money to put features in chips that are fused off or might as well be fused off because they aren't widely supported in the industry. And that they didn't see this as a problem and neither did their enablers in the computing media and software industry. Just for example, Apple used to ship the MKL libraries which like a turbocharger for matrix math back when they were using Intel chips. For whatever reason, Microsoft did not do this with Windows and neither did most Linux distributions so "the rest of us" are stuck with a fraction of the performance that we paid for. AMD did the right thing in introducing double pumped AVX-512 because at least assembly language wizards have some place where their code runs and the industry gets closer to the place where we can count on using an instruction set defined 12 years ago. [1] If I'd been tasked with updating the to next generation I would have written a compiler (if I take that many derivatives by hand I'll get one wrong.) My boss would have ordered me not to, I would have done it anyway and not checked it in. |
|
| ▲ | the__alchemist 3 hours ago | parent | next [-] |
| Noob question! What about AVX-512 makes it unique to assembly programmers? I'm just dipping my toes in, and have been doing some chemistry computations using f32x8, Vec3x8 etc (AVX-256). I have good workflows set up, but have only been getting 2x speedup over non-SIMD. (Was hoping for closer to 8). I figured AVX-512 would allow f32x16 etc, which would be mostly a drop-in. (I have macros to set up the types, and you input num lanes). |
|
| ▲ | ack_complete 15 hours ago | parent | prev | next [-] |
| AVX-512 also has a lot of wonderful facilities for autovectorization, but I suspect its initial downclocking effects plus getting yanked out of Alder Lake killed a lot of the momentum in improving compiler and library usage of it. Even the Steam Hardware Survey, which is skewed toward upper end hardware, only shows 16% availability of baseline AVX-512, compared to 94% for AVX2. |
| |
| ▲ | adgjlsfhk1 15 hours ago | parent | next [-] | | It will be interesting seeing what happens now that AMD is shipping good AVX-512. It really just makes Intel seem incompetent (especially since they're theoretically bringing AVX-512 back in next year anyway) | | |
| ▲ | ack_complete 14 hours ago | parent [-] | | No proof, but I suspect that AMD's AVX-512 support played a part in Intel dumping AVX10/256 and changing plans back to shipping a full 512-bit consumer implementation again (we'll see when they actually ship it). The downside is that AMD also increased the latency of all formerly cheap integer vector ops. This removes one of the main advantages against NEON, which historically has had richer operations but worse latencies. That's one thing I hope Intel doesn't follow. Also interesting is that Intel's E-core architecture is improving dramatically compared to the P-core, even surpassing it in some cases. For instance, Skymont finally has no penalty for denormals, a long standing Intel weakness. Would not be surprising to see the E-core architecture take over at some point. | | |
| ▲ | adgjlsfhk1 10 hours ago | parent [-] | | > For instance, Skymont finally has no penalty for denormals, a long standing Intel weakness. yeah, that's crazy to me. Intel has been so completely discunctional for the last 15 years. I feel like you couldn't have a clearer sign of "we have 2 completely separate teams that are competing with each other and aren't allowed to/don't want to talk to each other". it's just such a clear sign that the chicken is running around headless | | |
| ▲ | whizzter 6 hours ago | parent [-] | | Not really, to me it more seems like Pentium-4 vs Pentium-M/Core again. The downfall of Pentium 4 was that they had been stuffing things into longer and longer pipes to keep up the frequency race(with horrible branch latencies as a result). They scaled it all away by "resetting" to the P3/P-M/Core architecture and scaling up from that again. Pipes today are even _longer_ and if E-cores has shorter pipes at a similar frequency then "regular" JS,Java,etc code will be far more performant even if you lose a bit of perf for "performance" cases where people vectorize (Did the HPC computing crowd influence Intel into a ditch AGAIN? wouldn't be surprising!). |
|
|
| |
| ▲ | ezekiel68 6 hours ago | parent | prev [-] | | You mentioned "initial downclocking effects", yet (for posterity) I want to emphasize that in 2020 Ice Lake (Sunny Cove core) and later Intel processors, the downclocking is really a nothingburger. The fusing off debacle in desktop CPU families like Alder Lake you mentioned definitely killed the momentum though. I'm not sure why OS kernels couldn't have become partners in CPU capability queries (where a program starting execution could request a CPU core with 'X' such as AVX-512F, for example) -- but without that the whole P-core/E-core hybrid concept was DOA for capabilities which were not least-common denominators. If I had to guess, marketing got ahead of engineering and testing on that one. |
|
|
| ▲ | bee_rider 19 hours ago | parent | prev [-] |
| It is kind of a bummer that MKL isn’t open sourced, as that would make inclusion in Linux easier. It is already free-as-in-beer, but of course that doesn’t solve everything. Baffling that MS didn’t use it. They have a pretty close relationship… Agree that they are sort of going after hard-to-use niche features nowadays. But I think it is just that the real thing we want—single threaded performance for branchy code—is, like, incredibly difficult to improve nowadays. |
| |
| ▲ | PaulHoule 18 hours ago | parent [-] | | At the very least you can decode UTF-8 really quickly with AVX-512 https://lemire.me/blog/2023/08/12/transcoding-utf-8-strings-... and web browsers at the very least spent a lot of cycles on decoding HTML and Javascript which is UTF-8 encoded. It turns out AVX-512 is good at a lot of things you wouldn't think SIMD would be good at. Intel's got the problem that people don't want to buy new computers because they don't see much benefit from buying a new computer, but a new computer doesn't have the benefit it could have because of lagging software support, and the software support lags because there aren't enough new computers to justify the work to do the software support. Intel deserves blame for a few things, one of which is that they have dragged their feet at getting really innovative features into their products while turning people off with various empty slogans. They really do have a new instruction set that targets plain ordinary single threaded branchy code https://www.intel.com/content/www/us/en/developer/articles/t... they'll probably be out of business before you can use it. | | |
| ▲ | gatane 16 hours ago | parent | next [-] | | In the end, it doesnt even matter, javascript frameworks are already big enough to slow down your pc. Unless if said optimization on parsing runs at the very core of JS. | | | |
| ▲ | immibis 8 hours ago | parent | prev [-] | | If you pay attention this isn't a UTF-8 decoder. It might be some other encoding, or a complete misunderstanding of how UTF-8 works, or an AI hallucination. It also doesn't talk about how to handle the variable number of output bytes or the possibility of a continuation sequence split between input chunks. | | |
| ▲ | kjs3 3 hours ago | parent [-] | | I paid attention and I don't see where Daniel claimed that this a complete UTF-8 decoder. He's illustrating a programming technique using a simplified use case, not solving the worlds problems. And I don't think Daniel Lemire lacks an understanding of the concept or needs an AI to code it. |
|
|
|