▲	rerdavies 5 hours ago
		Sure, the code is strange, but it is not necessarily inefficient. The only way to determine whether it is inefficient is to profile the generated code. And perhaps, compare the performance of compiler-generated code with tweaked or hand-generated assembler code that you think might be better. GCC and Clang both have highly detailed models of processor execution pipelines that are used to perform optimization and instruction scheduling. This allows them to perform optimizations that mere mortals can only do with assistance from tools like Intel VTune, which provides insight into how execution pipelines are running, and where and when they are stalling. It's entirely possible that the multiple memory fetches may fuse in the execution pipeline, and that seemingly unnecessary instructions may dual issue, and execute in parallel. Or that minor variations in generated instructions may allow four instructions to be decoded in parallel instead of three at a critical moment in the code. These are the kind of insights that GCC and Clang have into how the code will actually execute that you do not. Both GCC and Clang have highly detailed models of the processor's execution pipeline for literally hundreds of processors. These models allow the compiler to determine which instructions execute in parallel, and to predict and avoid stalls at various places in the processor's execution pipeline. Counter-intuitively, many code optimization problems rely not upon the the instructions being executed, and not even on how many instructions are being executed, since pretty much every instruction in passably well-optimized code will execute in parallel with at least one other instruction. The actual problem becomes one of predicting whether any operation in a 7- to 20-stage execution pipeline will stall or not, and whether there are ways to schedule instructions so that the stalls either don't occur at all, or don't matter at all. Optimizations that are dependent on memory access are particularly perilous. Modern processors have elaborate and sometimes unpredictable methods for optimizing memory accesses: not just cache optimizations, but also fusing of reads and writes, optimizations for streaming reads and writes, single-cycle reads and writes for memory operations that look like they are stack-related, strategies for scheduling reads and writes to avoid bus-turnaround time, and probably others. Very often, the only thing that matters is the memory access stalls, with all other instructions operating in parallel in the time that it takes for the memory reads and writes to complete. (Does your processor have handling in the execution pipeline that prevents a potentially expensive branch misprediction in that tight code loop? I don't know. But GCC and Clang do!). For a human to compete with GCC or Clang code, intuition about how code executes isn't sufficient. If you are not using sophisticated profiling tools like Intel VTune, you really won't have insight into whether your hand-generated assembler is stalling in the execution pipeline. And that is typically the problem that determines how well code executes. How the data must flow is invariant from input to output. In this case, the input array must be read, and a register must be set to zero or one on output. And both compilers, and processor execution pipelines are capable of doing quite extraordinary things to maximize opportunites for parallel execution and pipelining. So. The ONLY way to tell whether any of that generated code is inefficient is to benchmark it. Intuition is not remotely sufficient. As far as I can see, both compilers have done quite heroic and spectacular jobs of optimizing code. It is not at all clear whether the compilers know something about how memory operations fuse in the instruction pipeline that you don't. The only oddity is the extra memory write to initialize the zero array that shows up in a single case, which, in fairness, occurs because you have introduced a faux optimization in the original code. One of the compilers heroically (and probably correctly) optimized the bulk of the code, and tragically missed an opportunity to remove a faux optimization that YOU have introduced. Even then, it's still not clear that an extra memory write is going to execute slower. (A write to l0 cache (either one or two cpu clock cycles), followed by a bunch of reads from l0 cache -- does the cache controller allow parallel reads and writes or does it not? I don't know, but GCC and Clang do! Obviously not a good thing, but does it ACTUALLY impact performance? I don't know. And the only way to tell, is for you to actually profile the code. Also worth mentioning in passing: if you are not compiling with --march=native, all your code is being optimized for some prehistoric ancient least-common-denominator Intel processor, probably a 1990's-era 486, that nobody actually has anymore that has god-only-knows what inadequacies in its execution pipeline. So make sure you are. - Credentials: professional programmer with 45 years experience, including extensive experience optimizing and profiling high-performance graphics device drivers, and audio plugin code, some of which was done in the era where humans actually could speed up compiler-generated code by (typically) 2 or 3%, in an industry when 2 or 3% improvements in benchmark scores could increase profits by millions of dollars. Currently of the opinion that any optimization that produces less than a 25% performance improvement is just not worth the extra effort and risk.
	▲	rsf 3 hours ago \| parent [-]
		> Sure, the code is strange, but it is not necessarily inefficient. Out of the 6 pieces of Assembly code in the article, 2 of them are definitely inefficient - specifically, the 2 clang ones that contain irrelevant writes to the stack. Even if a CPU was smart enough to ignore those instructions with no performance penalty (which in itself is doubtful), at the very least those instructions take up space in memory/caches unnecessarily. The gcc output when arraySize is 3 is almost certain to be inefficient as well, when you look at portions such as: `mov eax, 1 test eax, eax sete al ret` All this code is doing is to set eax to 0 and then returning. This could be done by simply replacing it with "xor eax, eax ; ret" or "mov eax, 0 ; ret" if there's a reason to avoid "xor" - there's already a mov there. The code as present also has the side effect of changing the CPU's flags, but this side effect can't be relied on as we return immediately, and flag values are not part of the returned values with this ABI. So yes, in general benchmarking is the only way to be sure. But when you look at the specifics of the generated code, we can see that at best 4 of the 6 snippets of Assembly code are optimal, and the actual number of optimal snippets is probably lower than 4 (my best guess is 2 here). All that said, I might benchmark everything later on and post a new article about it. > Also worth mentioning in passing: if you are not compiling with --march=native, all your code is being optimized for some prehistoric ancient least-common-denominator Intel processor, probably a 1990's-era 486, that nobody actually has anymore that has god-only-knows what inadequacies in its execution pipeline. So make sure you are. Yep: See https://news.ycombinator.com/item?id=46978577