LLVM is basically a resource pool for C++ compiler development. As such, it is highly C++ specific and leaks C++ semantics everywhere.

It's especially funny when this happens in Rust, which is marketed as a "safer" alternative.

Would you like a segfault out of nowhere in safe Rust? The issue is still open after two years by the way: https://github.com/rust-lang/rust/issues/107975

▲ saghm 2 days ago | parent | next [-]

It's not clear to me what you mean by default with regards to that issue. As far as I can tell, there's not really any indication that this is undefined behavior. Yes, there seems to be to a bug of some sort in the code being generated, but it seems like a stretch to me to imply that any bug that generates incorrect code is necessarily a risk of UB. Maybe I'm missing some assumption being made about what the pointers not being equal implies, but given that you can't actually dereference `*const T` in safe Rust, I don't really see where you're able to draw the conclusion that having two of them incorrectly not compare as equal could lead to unsafety.

	▲	tux3 2 days ago \| parent [-]
		If you read the Github issue, this one was weaponized fairly straightforwardly by taking the difference between the two pointers. The difference is zero, but the compiler thinks it is non-zero because it thinks they are unequal. From there you turn it into type confusion through an array, and then whatever you want. Almost any wrong compiler assumption can be exploited. This particular way to do it has also been used several times to exploit bugs in Javscript engines.

▲ ncruces 2 days ago | parent | prev | next [-]

Yeah, using LLVM for anything trying to avoid UB is crazy.

I got involved in a discussion with a Rust guy when trying to get C with SIMD intrinsics into wasi-libc where something that the C standard explicitly state is “implementation defined” (and so, sane, as we're targeting a single implementation - LLVM) can't be trusted, because LLVM may turn it back into UB because “reasons.”

At this point Go and Zig made the right choice to dump it. I don't know about Rust.

https://github.com/WebAssembly/wasi-libc/pull/593

▲ AndyKelley 2 days ago | parent [-]

It sounds like you have a fundamental misunderstanding about undefined behavior. It's easy to emit LLVM IR that avoids undefined behavior. The language reference makes it quite clear what constitutes undefined behavior and what does not.

The issue is that frontends want to emit code that is as optimizeable as possible, so they opt into the complexity of specifying additional constraints, attributes, and guarantees, each of which risks triggering undefined behavior if the frontend has a bug and emits wrong information.

▲ ncruces 2 days ago | parent [-]

Hi Andy. Did you read the linked thread?

I was not the one making this claim:

> However, I believe that currently, there is no well-defined way to actually achieve this on the LLVM IR level. Using plain loads for this is UB (even if it may usually work out in practice, and I'm sure plenty of C code just does that).

My claim is that the below snippet is implemention defined (not UB):

  // Casting through uintptr_t makes this implementation-defined,
  // rather than undefined behavior.
  uintptr_t align = (uintptr_t)s % sizeof(v128_t);
  const v128_t *v = (v128_t *)((uintptr_t)s - align);

Further, that this is actually defined by the implementation to do the correct thing, by any good faith reading of the standard:

> The mapping functions for converting a pointer to an integer or an integer to a pointer are intended to be consistent with the addressing structure of the execution environment.

I further suggested laundering the pointer with something like the below, but was told it would amount to nothing, again the blame being put on LLVM:

  asm ("" : "+r"(v))

I honestly don't know if LLVM or clang should be to blame. I was told LLVM IR and took it in good faith.

▲ AndyKelley 2 days ago | parent [-]

No, I hadn't read the linked thread until you prodded me. Now I have and I understand the situation entirely. I'll give a brief overview; feel free to ask any followup questions.

A straightforward implementation of memchr, i.e. finding the index of a particular byte inside an array of bytes, looks like this:

    for (bytes, 0..) |byte, i| {
        if (byte == search) return i;
    }
    return null;

This is trivial to lower to well-defined LLVM IR.

But it's desirable to use tricks to make the function really fast, such as assuming that you can read up to the page boundary with SIMD instructions[1]. This is generally true on real world hardware, but this is incompatible with the pointer provenance memory model, which is load-bearing for important optimizations that C, C++, Rust, and Zig all rely on.

So if you want to do such tricks you have to do it in a black box that is exempt from the memory model rules. The Zig code I link to here is unsound because it does not do this. An optimization pass, whether it be implemented in Zig pipeline or LLVM pipeline, would be able to prove that it writes outside a pointer provenance, mark that particular control flow unreachable, and thereby cause undefined behavior if it happens.

This is not really LLVM's fault. This is a language shortcoming in C, C++, Rust, Zig, and probably many others. It's a fundamental conflict between the utility of pointer provenance rules, and the utility of ignoring that crap and just doing what you know the machine allows you to do.

[1]: https://github.com/ziglang/zig/blob/0.14.1/lib/std/mem.zig#L...

▲

ncruces 2 days ago | parent [-]

Thanks for taking the time!

I was the original contributor of the SIMD code, and got this… pushback.

I still don't quite understand how you can marry ”pointer provenance” with the intent that converting between pointers and integers is “to be consistent with the addressing structure of the execution environment” and want to allow DMA in your language, but then this is UB.

But well, a workable version of it got submitted, I've made subsequent contributions (memchr, strchr, str[c]spn…), all good.

Just makes me salty on C, as if I needed more reasons to.

	▲	AndyKelley a day ago \| parent [-]
		That's totally fair to be salty about a legitimately annoying situation. But I think it's actually an interesting, fundamental complexity of computer science, as opposed to some accidental complexity that LLVM is bringing to the table.

▲ pjmlp 2 days ago | parent | prev [-]

Which is why nowadays most frontends have been migrating to MLIR, and there is also ongoing work for clang as well.

▲

AndyKelley 2 days ago | parent [-]

How does migrating to MLIR address the problem?

▲

pjmlp 2 days ago | parent [-]

The higher abstraction level it provides over the LLVM IR, making language frontends and compiler passes less dependent on its semantics.

▲

alexrp 2 days ago | parent | next [-]

As the guy currently handling Zig's LLVM upgrades, I do not see this as an advantage at all. The more IR layers I have to go through to diagnose miscompilations, the more of a miserable experience it becomes. I don't know that I would have the motivation to continue doing the upgrades if I also had to deal with MLIR.

▲

pjmlp 2 days ago | parent [-]

LLVM project sees that otherwise, and the adoption across the LLVM community is quite telling where they stand.

▲

alexrp 2 days ago | parent [-]

That doesn't seem like a good argument for why Zig ought to target MLIR instead of LLVM IR. I think I'd like to see some real-world examples of compilers for general-purpose programming languages using MLIR (ClangIR is still far from complete) before I entertain this particular argument.

▲

pjmlp 2 days ago | parent [-]

Would Flang do it? Fortran was once general purpose.

https://github.com/llvm/llvm-project/blob/main/flang/docs/Hi...

Maybe the work in Swift (SIL), Rust (MIR), Julia (SSAIR) that were partially the inspiration for MLIR, alongside work done at Google designing Tensorflow compiler?

The main goal being an IR that would accomodate all use cases of those high level IRs.

Here are the presentation talk slides at European LLVM Developers Meeting back in 2019,

https://llvm.org/devmtg/2019-04/slides/Keynote-ShpeismanLatt...

Also you can find many general purpose enough users around this listing,

https://mlir.llvm.org/users/

▲

pklausler 2 days ago | parent [-]

Are you saying that Fortran was once a general purpose programming language, but somehow changed to no longer be one?

▲

pjmlp 2 days ago | parent [-]

Yes, because we are no longer in the 1960's - 1980's.

C and C++ took over many of the use cases people where using Fortran for during those decades.

In 2025, while it is a general purpose language, its use is constrained to scientific computing and HPC.

Most wannabe CUDA replacements keep forgetting Fortran is one of the reasons scientific community ignored OpenCL.

	▲	pklausler 2 days ago \| parent [-]
		So you're saying that the changes made to Fortran have made it more specialized?

▲

AndyKelley 2 days ago | parent | prev [-]

Huh?? That can only make frontends' jobs more tricky.

▲

pjmlp 2 days ago | parent [-]

Yet is being embraced by everyone since its introduction in 2019, with its own organization and conference talks.

So maybe all those universities, companies and the LLVM project know kind of what they are doing.

- https://mlir.llvm.org/

- https://llvm.github.io/clangir/

- https://mlir.llvm.org/talks/

▲

AndyKelley 2 days ago | parent [-]

No need to make a weird appeal to authority. Can you just explain the answer to my question in your own words?

	▲	marcelroed 2 days ago \| parent \| next [-]
		I am only familiar with MLIR for accelerator-specific compilation, but my understanding is that by describing operations at a higher level, you don’t need the frontend to know what LLVM IR will lead to the best final performance. For instance you could say "perform tiled matrix multiplication" instead of "multiply and add while looping in this arbitrary indexing pattern", and an MLIR pass can reason about what pattern to use and take whatever hints you’ve given it. This is especially helpful when some implementations should be different depending on previous/next ops and what your target hardware is. I think there’s no reason Zig can’t do something like this internally, but MLIR is an existing way to build primitives at several different levels of abstraction. From what I’ve heard it’s far from ergonomic for compiler devs, though…
	▲	pjmlp 2 days ago \| parent \| prev [-]
		You see it as appeal to authority, I see it as the community of frontend developers, based on Swift and Rust integration experience, and work done by Chris Lattner, while working at Google, feedbacking into what the evolution of LLVM IR is supposed to look like. Mojo and Flang, were designed from scratch using MLIR, as there are many other newer languages on the LLVM ecosystem. I see it as the field experience of folks that know a little bit more than I ever will about compiler design.