Remix.run Logo
ncruces 2 days ago

Yeah, using LLVM for anything trying to avoid UB is crazy.

I got involved in a discussion with a Rust guy when trying to get C with SIMD intrinsics into wasi-libc where something that the C standard explicitly state is “implementation defined” (and so, sane, as we're targeting a single implementation - LLVM) can't be trusted, because LLVM may turn it back into UB because “reasons.”

At this point Go and Zig made the right choice to dump it. I don't know about Rust.

https://github.com/WebAssembly/wasi-libc/pull/593

AndyKelley 2 days ago | parent [-]

It sounds like you have a fundamental misunderstanding about undefined behavior. It's easy to emit LLVM IR that avoids undefined behavior. The language reference makes it quite clear what constitutes undefined behavior and what does not.

The issue is that frontends want to emit code that is as optimizeable as possible, so they opt into the complexity of specifying additional constraints, attributes, and guarantees, each of which risks triggering undefined behavior if the frontend has a bug and emits wrong information.

ncruces 2 days ago | parent [-]

Hi Andy. Did you read the linked thread?

I was not the one making this claim:

> However, I believe that currently, there is no well-defined way to actually achieve this on the LLVM IR level. Using plain loads for this is UB (even if it may usually work out in practice, and I'm sure plenty of C code just does that).

My claim is that the below snippet is implemention defined (not UB):

  // Casting through uintptr_t makes this implementation-defined,
  // rather than undefined behavior.
  uintptr_t align = (uintptr_t)s % sizeof(v128_t);
  const v128_t *v = (v128_t *)((uintptr_t)s - align);
Further, that this is actually defined by the implementation to do the correct thing, by any good faith reading of the standard:

> The mapping functions for converting a pointer to an integer or an integer to a pointer are intended to be consistent with the addressing structure of the execution environment.

I further suggested laundering the pointer with something like the below, but was told it would amount to nothing, again the blame being put on LLVM:

  asm ("" : "+r"(v))
I honestly don't know if LLVM or clang should be to blame. I was told LLVM IR and took it in good faith.
AndyKelley 2 days ago | parent [-]

No, I hadn't read the linked thread until you prodded me. Now I have and I understand the situation entirely. I'll give a brief overview; feel free to ask any followup questions.

A straightforward implementation of memchr, i.e. finding the index of a particular byte inside an array of bytes, looks like this:

    for (bytes, 0..) |byte, i| {
        if (byte == search) return i;
    }
    return null;
This is trivial to lower to well-defined LLVM IR.

But it's desirable to use tricks to make the function really fast, such as assuming that you can read up to the page boundary with SIMD instructions[1]. This is generally true on real world hardware, but this is incompatible with the pointer provenance memory model, which is load-bearing for important optimizations that C, C++, Rust, and Zig all rely on.

So if you want to do such tricks you have to do it in a black box that is exempt from the memory model rules. The Zig code I link to here is unsound because it does not do this. An optimization pass, whether it be implemented in Zig pipeline or LLVM pipeline, would be able to prove that it writes outside a pointer provenance, mark that particular control flow unreachable, and thereby cause undefined behavior if it happens.

This is not really LLVM's fault. This is a language shortcoming in C, C++, Rust, Zig, and probably many others. It's a fundamental conflict between the utility of pointer provenance rules, and the utility of ignoring that crap and just doing what you know the machine allows you to do.

[1]: https://github.com/ziglang/zig/blob/0.14.1/lib/std/mem.zig#L...

ncruces 2 days ago | parent [-]

Thanks for taking the time!

I was the original contributor of the SIMD code, and got this… pushback.

I still don't quite understand how you can marry ”pointer provenance” with the intent that converting between pointers and integers is “to be consistent with the addressing structure of the execution environment” and want to allow DMA in your language, but then this is UB.

But well, a workable version of it got submitted, I've made subsequent contributions (memchr, strchr, str[c]spn…), all good.

Just makes me salty on C, as if I needed more reasons to.

AndyKelley a day ago | parent [-]

That's totally fair to be salty about a legitimately annoying situation. But I think it's actually an interesting, fundamental complexity of computer science, as opposed to some accidental complexity that LLVM is bringing to the table.