It's common for compilers to generate mildly unusual code because they translate high-level code into an abstract intermediate notation, run a variety optimization steps on that notation, and then emit machine-specific code to perform whatever the optimizations yielded. There's no constraint along the lines of "but select the most logical opcode for this task".

The claim that the code is inefficient is really not substantiated well in this blog post. Sometimes, long-winded assembly actually runs faster because of pipelining, register aliasing, and other quirks. Other times, a "weird" way of zeroing a register may actually take up less space in memory, etc.

▲

magicalhippo 6 hours ago | parent | next [-]

> The claim that the code is inefficient is really not substantiated well in this blog post. Sometimes, long-winded assembly actually runs faster because of pipelining, register aliasing, and other quirks.

I had a case back in the 2010s where I was trying to optimize a hot loop. The loop involved an integer division by a factor which was common for all elements, similar to a vector normalization pass. For reasons I don't recall, I couldn't get rid of the division entirely.

I saw the compiler emitted an "idiv [mem]" instruction, and I thought surely that was suboptimal. So I reproduced the assembly but changed the code slightly so I could have "idiv reg" instead. All it involved was loading the variable into an unused register before the loop and use that inside the loop.

So I benchmarked it and much to my surprise it was a fair bit slower.

I thought I might have been due to loop target alignment, so I spent some time inserting no-ops to align things in various supposedly optimal ways, but it never got as fast. I changed my assembly to mirror what the compiler had spit out and voila, back to the fastest speed again...

Tried to ask around, and someone suggested it had to do with some internal register load/store contention or something along those lines.

At that point I knew I was done optimizing code by writing assembly. Not my cup of tea.

▲

meisel 3 hours ago | parent | next [-]

If you're doing enough divisions with the same divisor, it'd be faster to do what compilers do for division by a known constant, where they multiply by an integer reciprocal and shift

	▲	magicalhippo 19 minutes ago \| parent [-]
		Yea that can work well. I have extensive fixed-point math experience from my days of coding 3D graphics on my 286 and up, but for some reason I can't recall I didn't consider that viable in this case.

▲

sidewndr46 5 hours ago | parent | prev [-]

Not to suggest you weren't competent, but did you consider and try and control for the fact that your measurement could be the problem?

▲

magicalhippo 5 hours ago | parent [-]

Not going to dismiss it, but I did try to not do stupid stuff. I used QueryPerformanceCounter outside the loop, pinned the benchmark thread to a single core, and the array of elements it processed was fairly large. So I don't think overhead and throttling was an issue. The measurements were very consistent and repeatable.

▲

sidewndr46 4 hours ago | parent [-]

Fair enough, I've only really ever found assembly level optimization on embedded microcontrollers to make any degree of sense. Performance optimization usually means something along the lines of "convince co-workers not to implement their own bubble sort" in my lines of work

	▲	magicalhippo 2 hours ago \| parent [-]
		Yeah, I've also come across a lot of assembly code which was faster 10 years ago, but where the compiler now beats it. So for a while now my take has been to mostly avoid asm, but if needed always have a compiled version, and always do runtime performance detection to select optimal version.

▲

DannyBee 6 hours ago | parent | prev | next [-]

It's also rarely worth being optimal in scalar code anymore, particularly at compilation speed cost. The exception here is memory accesses and branches that will miss. So the writing of useless zeros is egregious but other stuff just isn't usually worth caring about these days. It's "good enough" in an age where even in embedded land I can run a 48mhz cortex m0 for 10 years on a battery and not worry about a few extra ANDS. I'm much more likely to hit size than speed limitations.

Not to mention for anything not super battery limited you can get a m55 running at 800mhz with a separate 1ghz npu, hardware video encoders, etc.

This is before you move into the rockchip/etc space.

We really just aren't scalar compute limited in tons of places these days. There are certainly places but 10-15 years ago missing little scalar optimizations could make very noticeable differences in the performance of lots of apps and now it just doesn't anymore

▲

rsf 5 days ago | parent | prev [-]

> The claim that the code is inefficient is really not substantiated well in this blog post.

I didn't run benchmarks, but in the case of clang writing zeros to memory (which are never used thereafter), there's no way that particular code is optimal.

For the gcc output, it seems unlikely that the three versions are all optimal, given the inconsistent strategies used. In particular, the code that sets the output value to 0 or 1 in the size = 3 version is highly unlikely to be optimal in my opinion. I'd be amazed if it is!

Your point that unintuitive code is sometimes actually optimal is well taken though :)

	▲	its_magic 4 days ago \| parent [-]
		Stefan Kanthak has previously noted that GCC's code generator is quite horrible, in these extensive investigations: https://skanthak.hier-im-netz.de/gcc.html