| ▲ | Both GCC and Clang generate strange/inefficient code(codingmarginalia.blogspot.com) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 63 points by rsf 5 days ago | 25 comments | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ▲ | the_fall 5 days ago | parent | next [-] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
It's common for compilers to generate mildly unusual code because they translate high-level code into an abstract intermediate notation, run a variety optimization steps on that notation, and then emit machine-specific code to perform whatever the optimizations yielded. There's no constraint along the lines of "but select the most logical opcode for this task". The claim that the code is inefficient is really not substantiated well in this blog post. Sometimes, long-winded assembly actually runs faster because of pipelining, register aliasing, and other quirks. Other times, a "weird" way of zeroing a register may actually take up less space in memory, etc. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ▲ | rwmj 5 hours ago | parent | prev | next [-] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
The OP should try with -march=native so the compiler can use vector instructions. Slightly off-topic but I like this way to test if memory is all zeroes: https://rusty.ozlabs.org/2015/10/20/ccanmems-memeqzero-itera... (see "epiphany #2" at the bottom of the page) I really wish there was a standard libc function for it. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ▲ | btdmaster 4 days ago | parent | prev | next [-] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
In my experience C++ abstractions give the optimizer a harder job and thus it generates worse code. In this case, different code is emitted by clang if you write a C version[0] versus C++ original[1]. Usually abstraction like this means that the compiler has to emit generic code which is then harder to flow through constraints and emit the same final assembly since it's less similar to the "canonical" version of the code that wouldn't use a magic `==` (in this case) or std::vector methods or something else like that. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ▲ | newpavlov 4 hours ago | parent | prev | next [-] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Compilers also like to unnecessarily copy data to stack: https://github.com/llvm/llvm-project/issues/53348 Which can be particularly annoying in cryptographic code where you want to minimize number of copies of sensitive data. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ▲ | gspr 4 hours ago | parent | prev | next [-] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
With `u32` as the element type, rustc 1.93 (with `-O`) does the correct thing for size=1, checks both elements separately (i.e. worse than in the article) for size=2, checks all three elements separately (i.e. not being crazy like in the article) for size=3, and starts doing SIMD at size=4. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ▲ | hulitu 44 minutes ago | parent | prev | next [-] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
> Both GCC and Clang generate strange/inefficient code At the same time Anthropic anounces its own compiler. How sweet. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ▲ | rerdavies 2 hours ago | parent | prev [-] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sure, the code is strange, but it is not necessarily inefficient. The only way to determine whether it is inefficient is to profile the generated code. And perhaps, compare the performance of compiler-generated code with tweaked or hand-generated assembler code that you think might be better. GCC and Clang both have highly detailed models of processor execution pipelines that are used to perform optimization and instruction scheduling. This allows them to perform optimizations that mere mortals can only do with assistance from tools like Intel VTune, which provides insight into how execution pipelines are running, and where and when they are stalling. It's entirely possible that the multiple memory fetches may fuse in the execution pipeline, and that seemingly unnecessary instructions may dual issue, and execute in parallel. Or that minor variations in generated instructions may allow four instructions to be decoded in parallel instead of three at a critical moment in the code. These are the kind of insights that GCC and Clang have into how the code will actually execute that you do not. Both GCC and Clang have highly detailed models of the processor's execution pipeline for literally hundreds of processors. These models allow the compiler to determine which instructions execute in parallel, and to predict and avoid stalls at various places in the processor's execution pipeline. Counter-intuitively, many code optimization problems rely not upon the the instructions being executed, and not even on how many instructions are being executed, since pretty much every instruction in passably well-optimized code will execute in parallel with at least one other instruction. The actual problem becomes one of predicting whether any operation in a 7- to 20-stage execution pipeline will stall or not, and whether there are ways to schedule instructions so that the stalls either don't occur at all, or don't matter at all. Optimizations that are dependent on memory access are particularly perilous. Modern processors have elaborate and sometimes unpredictable methods for optimizing memory accesses: not just cache optimizations, but also fusing of reads and writes, optimizations for streaming reads and writes, single-cycle reads and writes for memory operations that look like they are stack-related, strategies for scheduling reads and writes to avoid bus-turnaround time, and probably others. Very often, the only thing that matters is the memory access stalls, with all other instructions operating in parallel in the time that it takes for the memory reads and writes to complete. (Does your processor have handling in the execution pipeline that prevents a potentially expensive branch misprediction in that tight code loop? I don't know. But GCC and Clang do!). For a human to compete with GCC or Clang code, intuition about how code executes isn't sufficient. If you are not using sophisticated profiling tools like Intel VTune, you really won't have insight into whether your hand-generated assembler is stalling in the execution pipeline. And that is typically the problem that determines how well code executes. How the data must flow is invariant from input to output. In this case, the input array must be read, and a register must be set to zero or one on output. And both compilers, and processor execution pipelines are capable of doing quite extraordinary things to maximize opportunites for parallel execution and pipelining. So. The ONLY way to tell whether any of that generated code is inefficient is to benchmark it. Intuition is not remotely sufficient. As far as I can see, both compilers have done quite heroic and spectacular jobs of optimizing code. It is not at all clear whether the compilers know something about how memory operations fuse in the instruction pipeline that you don't. The only oddity is the extra memory write to initialize the zero array that shows up in a single case, which, in fairness, occurs because you have introduced a faux optimization in the original code. One of the compilers heroically (and probably correctly) optimized the bulk of the code, and tragically missed an opportunity to remove a faux optimization that YOU have introduced. Even then, it's still not clear that an extra memory write is going to execute slower. (A write to l0 cache (either one or two cpu clock cycles), followed by a bunch of reads from l0 cache -- does the cache controller allow parallel reads and writes or does it not? I don't know, but GCC and Clang do! Obviously not a good thing, but does it ACTUALLY impact performance? I don't know. And the only way to tell, is for you to actually profile the code. Also worth mentioning in passing: if you are not compiling with --march=native, all your code is being optimized for some prehistoric ancient least-common-denominator Intel processor, probably a 1990's-era 486, that nobody actually has anymore that has god-only-knows what inadequacies in its execution pipeline. So make sure you are. - Credentials: professional programmer with 45 years experience, including extensive experience optimizing and profiling high-performance graphics device drivers, and audio plugin code, some of which was done in the era where humans actually could speed up compiler-generated code by (typically) 2 or 3%, in an industry when 2 or 3% improvements in benchmark scores could increase profits by millions of dollars. Currently of the opinion that any optimization that produces less than a 25% performance improvement is just not worth the extra effort and risk. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||