Remix.run Logo
AVX-512: First Impressions on Performance and Programmability(shihab-shahriar.github.io)
58 points by shihab 5 days ago | 18 comments
nnevatie 43 minutes ago | parent | next [-]

I found this a weird article.

If you wish to see some speedups using AVX512, without limiting yourself to C or C++, you might want to try ISPC (https://ispc.github.io/index.html).

You'll get sane aliasing rules from the perspective of performance, multi-target binaries with dynamic dispatching and a lot more control over the code generated.

majke 16 minutes ago | parent [-]

Ispc looks interesting. Does it work with amd? They hint on gpu’s , i guess mostly intel ones?

camel-cdr 5 days ago | parent | prev | next [-]

> The answer, if it’s not obvious from my tone already:), is 8%.

Not if the data is small and in cache.

> The performant route with AVX-512 would probably include the instruction vpconflictd, but I couldn’t really find any elegant way to use it.

I think the best way to do this is duplicate sum_r and count 16 times, so each pane has a seperate accumulation bucket and there can't be any conflicts. After the loop, you quickly do a sum reduction for each of the 16 buckets.

shihab 5 days ago | parent [-]

Yeah N is big enough that entire data isn't in the cache, but the memory access pattern here is the next best thing: totally linear, predictable access. I remember seeing around 94%+ L1d cache hit rate.

pjmlp 5 days ago | parent | prev | next [-]

> In CPU world there is a desire to shield programmers from those low-level details, but I think there are two interesting forces at play now-a-days that’ll change it soon. On one hand, Dennard Scaling (aka free lunch) is long gone, hardware landscape is getting increasingly fragmented and specialized out of necessity, software abstractions are getting leakier, forcing developers to be aware of the lowest levels of abstraction, hardware, for good performance.

The problem is that not all programming languages expose SIMD, and even if they do it is only a portable subset, additionally the kind of skills that are required to be able to use SIMD properly isn't something everyone is confortable doing.

I certainly am not, still managed to get around with MMX and early SSE, can manage shading languages, and that is about it.

adgjlsfhk1 3 hours ago | parent [-]

The good news is that the portable subset of SIMD is all you really need anyway. If you go beyond the portable subset, you need per-architecture code writing and testing, and you're mostly talking about pretty small gains relative to the cost.

DeathArrow 25 minutes ago | parent | prev | next [-]

>On one hand, Dennard Scaling (aka free lunch) is long gone, hardware landscape is getting increasingly fragmented and specialized out of necessity, software abstractions are getting leakier, forcing developers to be aware of the lowest levels of abstraction, hardware, for good performance.

There are lots of people using Javascript frameworks to build slow desktop and mobile software.

user_7832 3 minutes ago | parent [-]

I wonder if the excess CO2 emitted by devices around the world using bloated software that has no need to be so (hullo MS Teams) could be calculated in terms of # of cross atlantic voyages of jets.

chillitom 4 hours ago | parent | prev | next [-]

Initial example takes array pointers without the __restrict__ keyword/extension so compiler might assume they could be aliased to same address space and will code defensively.

Would be interesting to see if auto vec performs better with that addition.

chillitom 4 hours ago | parent [-]

Also trying to let the compilers know that the float* are aligned would be a good move.

auto aligned_p = std::assume_aligned<16>(p)

magicalhippo 41 minutes ago | parent | next [-]

> let the compilers know that the float* are aligned

Reminded me of way back before OpenGL 2.0, and I was trying to get Vertex Buffer Objects working in my Delphi program using my NVIDIA graphics card. However it kept crashing occasionally, and I just couldn't figure out why.

I've forgotten a lot of the details, but either the exception message didn't make sense or I didn't understand it.

Anyway, after bashing my head for a while I had an epiphany of sorts. NVIDIA liked speed, vertices had to be manipulated before uploading to the GPU, maybe the driver used aligned SIMD instructions and relied on the default alignment of the C memory allocator?

In Delphi the default memory allocator at the time only did 4 byte aligned allocations, and so I searched and found that Microsoft's malloc indeed was default aligned to 16 bytes. However the OpenGL standard and VBO extension didn't say anything about alignment...

Manually aligned the buffers and voila, the crashes stopped. Good times.

Remnant44 4 hours ago | parent | prev [-]

which honestly, shouldn't be neccessary today with avx512. There's essentially no reason to prefer the aligned load/store commands over the unaligned ones - if the actual pointer is unaligned it will function correctly at half the throughput, while if it_is_ aligned you will get the same performance as the aligned-only load.

No reason for the compiler to balk at vectorizing unaligned data these days.

dmpk2k 2 hours ago | parent [-]

> There's essentially no reason to prefer the aligned load/store commands over the unaligned ones - if the actual pointer is unaligned it will function correctly at half the throughput

Getting a fault instead of half the performance is actually a really good reason to prefer aligned load/store. To be fair, you're talking about a compiler here, but I never understood why people use the unaligned intrinsics...

Remnant44 an hour ago | parent [-]

There are many situations where your data is essentially _majority_ unaligned. Considerable effort by the hardware guys has gone into making that situation work well.

A great example would be a convolution-kernel style code - with AVX512 you are using 64 bytes at a time (a whole cacheline), and sampling a +- N element neighborhood around a pixel. By definition most of those reads will be unaligned!

A lot of other great use cases for SIMD don't let you dictate the buffer alignment. If the code is constrained by bandwidth over compute, I have found it to be worth doing a head/body/tail situation where you do one misaligned iteration before doing the bulk of the work in alignment, but honestly for that to be worth it you have to be working almost completely out of L1 cache which is rare... otherwise you're going to be slowed down to L2 or memory speed anyways, at which point the half rate penalty doesn't really matter.

The early SSE-style instructions often favored making two aligned reads and then extracting your sliding window from that, but there's just no point doing that on modern hardware - it will be slower.

fithisux 5 days ago | parent | prev | next [-]

What I get in these article is that the original intent on C language stands true.

Use C as a common platform denominator without crazy optimizations (like tcc). If you need performance, specialize, C gives you the tools to call assembly (or use compiler some intrinsic or even inline assembly).

Complex compiler doing crazy optimizations, in my opinion, is not worth it.

kergonath an hour ago | parent | next [-]

> Complex compiler doing crazy optimizations, in my opinion, is not worth it.

For these optimisations that are in the back-end, they are used for other languages that can be higher-level or that cannot drop to assembler as easily. C is just one of the front-ends of modern compiler suites.

eru an hour ago | parent | prev [-]

Well, C is a lie anyway: it's not how computers work any more (and I'm not sure it's how they ever worked).

ecesena 4 hours ago | parent | prev [-]

If you have the opportunity, try out a zen5. Significant improvements.

See also https://www.numberworld.org/blogs/2024_8_7_zen5_avx512_teard...