▲ | holowoodman 4 days ago | |||||||||||||
>> LLVM is still slower than GCC, even in plain C. > Citation needed. Last time I checked LLVM beat GCC on -O3. https://benchmarksgame-team.pages.debian.net/benchmarksgame/... https://www.phoronix.com/review/gcc-clang-eoy2023/8 There is tons more, just ask your favourite internet search engine. > Why should two nearly idenfical operations have such wildly different performance? And why isn't the safer/saner interface more performant? Because in hardware, checking a zero-flag on a register is very cheap. Big NAND over all bits of a sub-register, big OR over all those flags for the whole SIMD register. Very short path, very few transistors. Checking a length is expensive: Increment a length register through addition, including a carry, which is a long path. Compare with another register to out if you are at the last iteration. Compare usually is also expensive since it is subtract with carry internally, even though you could get away with xor + zero flag. But you can't get around the length register addition. There can be an optimisation because you do have to increment the address you fetch from in any case. But in most modern CPUs, that happens in special pointer registers internally, if you do additional arithmetics on those it'll be expensive again. x86 actually has more complex addressing modes that handle those, but compilers never really used them, so those were dropped for SIMD. | ||||||||||||||
▲ | jcranmer 4 days ago | parent | next [-] | |||||||||||||
A loop that's basically:
is going to compile down to something like:
(Okay, to be fair, compilers tend to change the loop bounds so that the final comparison is a comparison against 0. This will reduce the register usage of the loop, but otherwise doesn't change the analysis). Taking into account micro-op fusion, there is at best two actual hardware ops in that assembly pattern, since the comparison and branch will be fused, and the increment(/decrement) is also a decent candidate for fusion. The branch can also be predicted with a high degree of fidelity--an advanced processor should be able to predict the branch with exactly 100% accuracy.A null-terminated loop instead compiles down to:
You still have the some basic pattern at the end of the loop, except that the INC (which might not have been previously fused) is no longer in the critical path, but now the critical path instead has a load in it. That load is a much slower instruction than a test-against-0 or a comparison: no load completes in one cycle, whereas both tests (bitwise and) and comparisons (subtraction) complete in one cycle. Splitting hairs by saying that the test-against-0 is going to be less expensive than a subtract-and-check-carry, so therefore null-terminated is faster ignores the fact that they execute in the same time and brushes the extra load, which is definitely slower, that exists in the null-terminated string.It goes worse, though. When you have a pointer+length combo, the length of the array, and the legal dereferenceability, is now known to the compiler in symbolic form that is easy to compute. That makes it possible to do things like unroll loops, or vectorize loops, or rewrite loop bounds in better forms, etc. When it's null-terminated, the length comes from reading memory, and the amount of memory you can read is not known ahead of time. Unrolling the loop or other steps are no longer really possible since you don't know how many iterations the loop will take, and the extra loop exits you'd generate tend not to be tolerable in optimization. | ||||||||||||||
| ||||||||||||||
▲ | miloignis 4 days ago | parent | prev [-] | |||||||||||||
Your Phoronix link shows clang as faster than GCC (more is better on the graph). The benchmark game results seem to go back and forth. I don't think that it's convincing evidence that clang is slower than GCC. | ||||||||||||||
|