Remix.run Logo
Sesse__ 2 days ago

It was a 20-way nested loop (!), but it probably spent all (>99%) of its time in a few of the depths. Pretty sure all of the actually executed code would fit into the LSD.

Then I moved stuff into huge precalced arrays instead, and it became intensely memory bound. :-)

menaerus 2 days ago | parent [-]

Yeah, it might be the LSD then, basically no frontend involved after the first loop iteration, and then no bottleneck in the backend as well.

So, what did you end up having in the code? Ugly and fast or nice and slow? :)

Sesse__ 2 days ago | parent [-]

It's essentially research code, so it's getting uglier and uglier and faster and faster :-) It has stuff like “if I remove this assert(), then Clang does something stupid and 30% of CPU time is spent stalling on this single instruction, so meh, leave it in”. It's not going to be maintained once it's done its computation job. (https://oeis.org/draft/A286874 if you're curious.)

menaerus 2 days ago | parent [-]

> if I remove this assert(), then Clang does something stupid and 30% of CPU time is spent stalling on this single instruction, so meh, leave it in

Classic compiler games and similar happened to me just recently when I wrote a micro-optimized SIMD code for some monotonically increasing integer sequence utility that achieved like 80% of the theoretical IPC (for skylake-x) in ubenchmarks, however, once I moved the code from ubenchmark to the production code what I saw was surprising (or not really) - compiler merged my carefully optimized SIMD code with the surrounding code and largely nullified the optimizations I've done.

Sesse__ a day ago | parent [-]

Haha, yes, autovectorization is so much in the way sometimes. I have a bunch of hard-coded AVX2/AVX512 intrinsics lying around since the compiler can do it fine on Compiler Explorer but not in context. Still, having a stall on a single 512-bit add like that suggests something very odd in the µarch. Perhaps something like “we're all out of physical registers and we're going into some kind of panic mode” that is avoided by inserting the assert() branches and slowing things down. No idea, I'm not a Zen microarchitecture expert.

Edit: I ran the code on an Intel CPU (Kaby Lake, on my laptop) and there's no slowdown when removing the assert(). So it really seems to be something Zen-specific and weird.

menaerus a day ago | parent [-]

I started to appreciate that compilers can do only as much and from my experience auto-vectorization doesn't really shine that much, it leaves a lot of performance on the table, and then it also messes up with the hand optimized code.

> So it really seems to be something Zen-specific and weird.

Number and/or type of ports. Perhaps even the code generation is different so it could be the compiler backend differences too for different uarchs