Remix.run Logo
Ygg2 4 days ago

> Compilers for higher-level languages are full of broken promises

Sometimes because C is lingua franca of low level.

Some noalias optimizations Rust had didn't work in LLVM, because no one bothered using it before in C.

This goes even further to hardware. C-like null terminated string search SIMD is faster than a saner (pointer + len ) string view. So it's faster to append null to end of string view than to invoke the SIMD on string slice.

holowoodman 4 days ago | parent [-]

And here we go again with the compiler excuses.

C standards with the 'restrict' keyword to allow aliasing optimisations have been existing longer than Rust. LLVM just never bothered, despite the early claims that all the intermediate language and "more modern" compiler would enable more optimisation potential. LLVM is still slower than GCC, even in plain C.

Where is the problem to quickly null-terminate a pointer+len string and use the quick null-terminated SIMD search? Should only be 1 move slower than the native C. "Space after that string" shouldn't be a problem, since in higher-level languages, allocations can make enough room, it is higher-level after all (and you need null-termination for syscalls anyways).

It is always the same story. Compiler and programming language people make claims about future optimisations but never ever follow through. It's always just theory, never implementation.

chrismorgan 4 days ago | parent | next [-]

LLVM did do some with noalias, but it wasn’t very mature because it wasn’t broadly useful in C/C++, because it was too hard to use correctly.

Then Rust made it trivial to use correctly, and found that what LLVM did have was quite buggy, because it hadn’t been exercised. These were bugs that could in theory have been found in C/C++, but in practice never would have been. And so for quite some years Rust would toggle the noalias optimisations on and off, depending on whether there were any known bad-codegen issues at the time. I think the last toggle was a few years ago by now, this stuff is finally actually stable and appreciated.

My recollection is that the figures for the Rust compiler are in the vicinity of a 5% improvement from emitting noalias in the LLVM IR.

And it’s likely that quite a bit more is possible.

Ygg2 4 days ago | parent | prev [-]

> LLVM just never bothered, despite the early claims that all the intermediate language and "more modern" compiler would enable more optimisation potential.

They did bother but no one seemed to be using it, so a bug snuck in. It took Rust exercising that corner of LLVM to find the bug.

> LLVM is still slower than GCC, even in plain C.

Citation needed. Last time I checked LLVM beat GCC on -O3. Unless you mean compilation performance.

> Where is the problem to quickly null-terminate a pointer+len string and use the quick null-terminated SIMD search?

Why should two nearly idenfical operations have such wildly different performance? And why isn't the safer/saner interface more performant?

My point if compilers and hardware are optimizing for C, it's no surprise no one can approach its speed.

It's like asking why when everyone is optimizing for Chrome no other web browser can approach its speed.

holowoodman 4 days ago | parent [-]

>> LLVM is still slower than GCC, even in plain C.

> Citation needed. Last time I checked LLVM beat GCC on -O3.

https://benchmarksgame-team.pages.debian.net/benchmarksgame/...

https://www.phoronix.com/review/gcc-clang-eoy2023/8

There is tons more, just ask your favourite internet search engine.

> Why should two nearly idenfical operations have such wildly different performance? And why isn't the safer/saner interface more performant?

Because in hardware, checking a zero-flag on a register is very cheap. Big NAND over all bits of a sub-register, big OR over all those flags for the whole SIMD register. Very short path, very few transistors.

Checking a length is expensive: Increment a length register through addition, including a carry, which is a long path. Compare with another register to out if you are at the last iteration. Compare usually is also expensive since it is subtract with carry internally, even though you could get away with xor + zero flag. But you can't get around the length register addition.

There can be an optimisation because you do have to increment the address you fetch from in any case. But in most modern CPUs, that happens in special pointer registers internally, if you do additional arithmetics on those it'll be expensive again. x86 actually has more complex addressing modes that handle those, but compilers never really used them, so those were dropped for SIMD.

jcranmer 4 days ago | parent | next [-]

A loop that's basically:

  for (int i = 0; i < len; i++)
    /* do something */
is going to compile down to something like:

  label: ; blah blah blah
    INC r0
    CMP r0, r1
    JNZ label
(Okay, to be fair, compilers tend to change the loop bounds so that the final comparison is a comparison against 0. This will reduce the register usage of the loop, but otherwise doesn't change the analysis). Taking into account micro-op fusion, there is at best two actual hardware ops in that assembly pattern, since the comparison and branch will be fused, and the increment(/decrement) is also a decent candidate for fusion. The branch can also be predicted with a high degree of fidelity--an advanced processor should be able to predict the branch with exactly 100% accuracy.

A null-terminated loop instead compiles down to:

  label: ; blah blah blah
  LD r0, r1
  INC r1
  CMP r0, 0
  JNZ label
You still have the some basic pattern at the end of the loop, except that the INC (which might not have been previously fused) is no longer in the critical path, but now the critical path instead has a load in it. That load is a much slower instruction than a test-against-0 or a comparison: no load completes in one cycle, whereas both tests (bitwise and) and comparisons (subtraction) complete in one cycle. Splitting hairs by saying that the test-against-0 is going to be less expensive than a subtract-and-check-carry, so therefore null-terminated is faster ignores the fact that they execute in the same time and brushes the extra load, which is definitely slower, that exists in the null-terminated string.

It goes worse, though. When you have a pointer+length combo, the length of the array, and the legal dereferenceability, is now known to the compiler in symbolic form that is easy to compute. That makes it possible to do things like unroll loops, or vectorize loops, or rewrite loop bounds in better forms, etc. When it's null-terminated, the length comes from reading memory, and the amount of memory you can read is not known ahead of time. Unrolling the loop or other steps are no longer really possible since you don't know how many iterations the loop will take, and the extra loop exits you'd generate tend not to be tolerable in optimization.

aengelke 4 days ago | parent | next [-]

Storing the string length explicitly as an 8-byte integer does have a measurable cost. Consider llvm::Twine as an example, it supports storing a null-terminated string and a ptr+len string (among other options). I once changed the implementation to store string literals (length known at compile-time) as ptr+len instead of a pointer to a C string, with the intention of avoiding the strlen in the callee on constant strings. However, this change made things slower, because of the cost of storing the length everywhere. (That's why I never proposed such a change upstream.)

The critical (data) path of the null-terminated loop, however, does not include the load -- the actually loaded value is not a loop-carried dependency in your example. The re-steering of the branch at the end of the loop might happen much later, however.

Vectorization with null-terminated strings is possible and done, but requires alignment checking, which adds some cost.

holowoodman 4 days ago | parent | prev [-]

No. First that isn't SIMD code. Second, CMP r0,0 is redundant, operating on a register that is zero will set the zero flag automatically. You also skipped the string search/comparison part, but that isn't relevant, except that the operation will implicitly set the zero flag. Generally, plain JZ is faster than CMP + JZ. Even more so if it is SIMD.

And loads are amortized by tons of speculative preloading. Which isn't (yet, unfortunately) optimized by preload-length hints.

Oh, and since we are talking about comparing strings, there is no extra load. This isn't strlen(), this is strchr() or strstr() we are talking about. You always have to load the string into some register.

miloignis 4 days ago | parent | prev [-]

Your Phoronix link shows clang as faster than GCC (more is better on the graph). The benchmark game results seem to go back and forth.

I don't think that it's convincing evidence that clang is slower than GCC.

holowoodman 4 days ago | parent [-]

Yes, you are right. I think they are head-to-head nowadays.

I was thinking about, and meaning to post that one, but got the wrong search result:

https://www.phoronix.com/review/clang20-gcc15-amd-znver5/5

But they swapped places a few times over the last years, so basically there doesn't seem to be a fundamental difference in performance anymore.