>> LLVM is still slower than GCC, even in plain C.

> Citation needed. Last time I checked LLVM beat GCC on -O3.

https://benchmarksgame-team.pages.debian.net/benchmarksgame/...

https://www.phoronix.com/review/gcc-clang-eoy2023/8

There is tons more, just ask your favourite internet search engine.

> Why should two nearly idenfical operations have such wildly different performance? And why isn't the safer/saner interface more performant?

Because in hardware, checking a zero-flag on a register is very cheap. Big NAND over all bits of a sub-register, big OR over all those flags for the whole SIMD register. Very short path, very few transistors.

Checking a length is expensive: Increment a length register through addition, including a carry, which is a long path. Compare with another register to out if you are at the last iteration. Compare usually is also expensive since it is subtract with carry internally, even though you could get away with xor + zero flag. But you can't get around the length register addition.

There can be an optimisation because you do have to increment the address you fetch from in any case. But in most modern CPUs, that happens in special pointer registers internally, if you do additional arithmetics on those it'll be expensive again. x86 actually has more complex addressing modes that handle those, but compilers never really used them, so those were dropped for SIMD.

▲ jcranmer 4 days ago | parent | next [-]

A loop that's basically:

  for (int i = 0; i < len; i++)
    /* do something */

is going to compile down to something like:

  label: ; blah blah blah
    INC r0
    CMP r0, r1
    JNZ label

(Okay, to be fair, compilers tend to change the loop bounds so that the final comparison is a comparison against 0. This will reduce the register usage of the loop, but otherwise doesn't change the analysis). Taking into account micro-op fusion, there is at best two actual hardware ops in that assembly pattern, since the comparison and branch will be fused, and the increment(/decrement) is also a decent candidate for fusion. The branch can also be predicted with a high degree of fidelity--an advanced processor should be able to predict the branch with exactly 100% accuracy.

A null-terminated loop instead compiles down to:

  label: ; blah blah blah
  LD r0, r1
  INC r1
  CMP r0, 0
  JNZ label

You still have the some basic pattern at the end of the loop, except that the INC (which might not have been previously fused) is no longer in the critical path, but now the critical path instead has a load in it. That load is a much slower instruction than a test-against-0 or a comparison: no load completes in one cycle, whereas both tests (bitwise and) and comparisons (subtraction) complete in one cycle. Splitting hairs by saying that the test-against-0 is going to be less expensive than a subtract-and-check-carry, so therefore null-terminated is faster ignores the fact that they execute in the same time and brushes the extra load, which is definitely slower, that exists in the null-terminated string.

It goes worse, though. When you have a pointer+length combo, the length of the array, and the legal dereferenceability, is now known to the compiler in symbolic form that is easy to compute. That makes it possible to do things like unroll loops, or vectorize loops, or rewrite loop bounds in better forms, etc. When it's null-terminated, the length comes from reading memory, and the amount of memory you can read is not known ahead of time. Unrolling the loop or other steps are no longer really possible since you don't know how many iterations the loop will take, and the extra loop exits you'd generate tend not to be tolerable in optimization.

	▲	aengelke 4 days ago \| parent \| next [-]
		Storing the string length explicitly as an 8-byte integer does have a measurable cost. Consider llvm::Twine as an example, it supports storing a null-terminated string and a ptr+len string (among other options). I once changed the implementation to store string literals (length known at compile-time) as ptr+len instead of a pointer to a C string, with the intention of avoiding the strlen in the callee on constant strings. However, this change made things slower, because of the cost of storing the length everywhere. (That's why I never proposed such a change upstream.) The critical (data) path of the null-terminated loop, however, does not include the load -- the actually loaded value is not a loop-carried dependency in your example. The re-steering of the branch at the end of the loop might happen much later, however. Vectorization with null-terminated strings is possible and done, but requires alignment checking, which adds some cost.
	▲	holowoodman 4 days ago \| parent \| prev [-]
		No. First that isn't SIMD code. Second, CMP r0,0 is redundant, operating on a register that is zero will set the zero flag automatically. You also skipped the string search/comparison part, but that isn't relevant, except that the operation will implicitly set the zero flag. Generally, plain JZ is faster than CMP + JZ. Even more so if it is SIMD. And loads are amortized by tons of speculative preloading. Which isn't (yet, unfortunately) optimized by preload-length hints. Oh, and since we are talking about comparing strings, there is no extra load. This isn't strlen(), this is strchr() or strstr() we are talking about. You always have to load the string into some register.

▲ miloignis 4 days ago | parent | prev [-]

Your Phoronix link shows clang as faster than GCC (more is better on the graph). The benchmark game results seem to go back and forth.

I don't think that it's convincing evidence that clang is slower than GCC.

	▲	holowoodman 4 days ago \| parent [-]
		Yes, you are right. I think they are head-to-head nowadays. I was thinking about, and meaning to post that one, but got the wrong search result: https://www.phoronix.com/review/clang20-gcc15-amd-znver5/5 But they swapped places a few times over the last years, so basically there doesn't seem to be a fundamental difference in performance anymore.