▲ | Validark 3 days ago | |
I would be interested in more examples where "assembly is faster than intrinsics". I.e., when the compiler screws up. I generally write Zig code with the expectation of a specific sequence of instructions being emitted, and I usually get it via the high level wrappers in std.simd + a few llvm intrinsics. If those fail I'll use inline assembly to force a particular instruction. On extremely rare occasions I'll rely on auto-vectorization, if it's good and I want it to fall back on scalar on less sophisticated CPU targets (although sometimes it's the compiler that lacks sophistication). Aside from the glaring holes in the VPTERNLOG finder, I feel that instruction selection is generally good enough that I can get whatever I want. The bigger issue is instruction ordering and register allocation. On code where the compiler effectively has to lower serially-dependent small snippets independently, I think the compiler does a great job. However, when it comes to massive amounts of open code I'm shocked at how silly the decisions are that the compiler makes. I see super trivial optimizations available at a glance. Things like spilling x and y to memory, just so it can read them both in to do an AND, and spill it again. Constant re-use is unfortunately super easy to break: Often just changing the type in the IR makes it look different to the compiler. It also seems unable to merge partially poisoned (undefined) constants with other constants that are the same in all the defined portions. Even when you write the code in such a way where you use the same constant twice to get around the issue, it will give you two separate constants instead. I hope we can fix these sorts of things in compilers. This is just my experience. Let me know if I left anything out. |