Remix.run Logo
ncruces 4 days ago

The current standard still says integer-to-pointer conversions are implementation defined (not undefined) and furthermore "intended to be consistent with the addressing structure of the execution environment" (that's a direct quote).

I have an execution environment, Wasm, where doing this is pretty well defined, in fact. So if I want to read the memory at address 12345, which is within bounds of the linear memory (and there's a builtin to make sure), why should it be undefined behavior?

And regarding pointer provenance, why should going through a pointer-to-integer and integer-to-pointer conversions try to preserve provenance at all, and be undefined behavior in situations where that provenance is ambiguous?

The reason I'm using integer (rather than pointer) arithmetic is precisely so I don't have to be bound by pointer arithmetic rules. What good purpose does it serve for this to be undefined (rather than implementation defined) beyond preventing certain programs to be meaningfully written at all?

I'm genuinely curious.

uecker 4 days ago | parent | next [-]

I fully agree with your analysis but compilers writers did think the could bend the rules, hence it was necessary to clarify that pointer-to-integer casts do work as intended. This still not in ISO C 23 btw because some compiler vendors did argue against it. But it is a TS now. If you are, please file bugs against your compilers.

gpderetta 4 days ago | parent [-]

Do you fully agree? I finally went and read n3005.pdf. The important item there is that a cast to integer exposes the pointer and now the compiler must be conservative and assume that the pointed object might be changed via non trackable pointers. This seems quite a reasonable compromise to make existing code work without affecting the vast majority of objects whose address is never cast to an integer. But ncruces wants defined semantics for arbitrary forged pointers.

uecker 4 days ago | parent [-]

You are right, I wasn't thinking straight. I do not fully agree. Creating arbitrary pointers can not work. Forging pointers to implementation-defined memory region would be ok though.

ncruces 4 days ago | parent [-]

Why can't it work though?

And I'm taking about both things.

Integer arithmetic that produces pointers that are just out of bounds of an object. Why can't this work? Why can't the compiler assume that, since I explicitly converted a pointer to an integer, the pointed-to object can't be put into a register, or made to go out or scope early?

Second, fabricating pointers. If I have a pointer to mmap/sbrk memory, shouldn't I be allowed to “fabricate” arbitrary pointers from integers that point into that area? If not, why not?

Finally Wasm. The linear memory is addressable from address 0 to __builtin_wasm_memory_size * PAGESIZE. Given this, and except maybe the address at zero, why should it be undefined behavior to dereference any other address?

What's the actual advantage to making these undefined behavior? What to we gain in return?

gpderetta 4 days ago | parent | next [-]

In practice if you do a volatile read at an arbitrary mapped address it will work. But you have no guarantee regarding what you will read from it, even it if it happens to match the address of a variable you just wrote into.

Formally it is undefined, as there is no way to give it sane semantics and it will definitely trip sanitizers and similar memory safety tools.

ncruces 4 days ago | parent [-]

So how can you implement an allocator in the language itself? Not even talking about malloc (which gets to have language blessed semantics); say an arena allocator.

You get a bunch of memory from mmap. There are no “objects” (in C terminology) in there, and a single “provenance” (if at all, you got it from a syscall, which is not part of the language).

If arbitrary integer/pointer math inside that buffer is not OK, how do to get heterogeneous allocations from the arena to work? When do slices of it become “objects” and gain any other ”provenance”?

Is C supposed to be the language you can't write an arena allocator in (or a conservative GC, or…)?

gpderetta 3 days ago | parent | next [-]

Handling allocators correctly it is actually quite problematic. In C++ you would placement-new into the raw storage, which ends the lifetime of whatever was there and start the lifetime of a new object, and as long as you use the pointer returned by operator new (or use std::launder), formally you are ok.

Famously you cannot implement an allocator in C into static named storage; I understand that on anonymous memory (like that returned by sbrk or mmap, or an upstream allocator) it should work fine, but, as a C++ programmer, I'm not familiar with the specific details of the C lifetime model that allow it. I understand that stores into anonymous memory can change the dynamic type (or whatever is the C equivalent) of the object.

In any case the issue is around object lifetimes and aliasing instead of pointer provenance: you can treat the anonymous memory is just a char array and you can safely form pointers into it and will carry the correct provenance information.

pjmlp 3 days ago | parent | prev [-]

You don't that is why people that never bothered with ISO C legalese have this idea of several tricks being C, when they are in practice "Their C Compiler" language.

Several things in C, if you want to stay within the guarantees of ISO C for portable code, have to be written in straight Assembly, not even inline, as the standard only defines a asm keyword must exist, leaving all other details to the implementation.

uecker 3 days ago | parent | prev [-]

Integer arithmetic working correctly, converting a pointer to an integer, and converting an integer back to the same object is something which should work. This is what we made sure to guarantee in the provenance TS. Making this work for memory outside of any object or converting back to a different object whose address was not converted to an integer previously is difficult. It can be made to work, but then you need to give up a lot (and rewrite your compilers). Outside of an object is clear, because there might be no memory mapped. Converting back to an arbitrary object does not work because the compiler must know somehow that it can not put the object into a register. If you allow conversion back to arbitrary objects, it can not put anything into registers. This would be bad.

Fabricating pointers in an implementation-defined way is obviously ok. This would cover mmap / sbrk or mapped I/O. Note also that historically, the C standard left things UB exactly so that implementations can use it for extensions (e.g. mmap). The idea that UB == invalid program is fairly recent misinformation, but we have to react to it and make things at least implementation-defined (which also meant a bit something else before).

ncruces 3 days ago | parent [-]

> Note also that historically, the C standard left things UB exactly so that implementations can use it for extensions (e.g. mmap). The idea that UB == invalid program is fairly recent misinformation, but we have to react to it and make things at least implementation-defined (which also meant a bit something else before).

I'll just finish by saying: yes please. And thank you for bearing with me.

SkiFire13 4 days ago | parent | prev | next [-]

> I have an execution environment, Wasm, where doing this is pretty well defined, in fact. So if I want to read the memory at address 12345, which is within bounds of the linear memory (and there's a builtin to make sure), why should it be undefined behavior?

How would you define it? Especially in a way that is consistent with the rest of the language and allows common optimizations (remember that C supports variables, which may or may not be stored in memory)?

ncruces 4 days ago | parent [-]

Just read whatever is at address 12345 of the linear memory. Doesn't matter what that is. If it's an object, if it was malloc'ed, if it's the "C stack", a "global".

It's the only way to interpret *(uint64_t*)(12345) when the standard says that a integer-to-pointer conversion is "intended to be consistent with the addressing structure of the execution environment".

There exists an instruction to do that load in Wasm, there's a builtin to check that 12345 points to addressable memory, the load is valid at the assembly level, the standard says the implementation should define this to be consistent with the addressing structure of the execution environment, why the heck are we playing games and allowing the compiler to say, "nope, that's not valid, so your entire program is invalid, and we can do what ever we want, no diagnostic required"?

vlovich123 4 days ago | parent [-]

If a newer version of that value is also stored in a register and not yet flushed to memory, should the compiler know to insert that flush for your or is reading a stale value ok?

For what it’s worth there’s a reason you’re supposed to do this kind of access through memcpy, not by dereferencing made up pointers.

> There exists an instruction to do that load in Wasm, there's a builtin to check that 12345 points to addressable memory, the load is valid at the assembly level, the standard says the implementation should define this to be consistent with the addressing structure of the execution environment, why the heck are we playing games and allowing the compiler to say, "nope, that's not valid, so your entire program is invalid, and we can do what ever we want, no diagnostic required"?

Because the language standard is defined to target a virtual machine as output, not any given implementation. That virtual machine is then implemented on various platforms, but the capabilities of the underlying system aren’t directly accessible - they are only there to implement the C virtual machine. That’s why C can target so many different target machines.

ncruces 4 days ago | parent [-]

> If a newer version of that value is also stored in a register and not yet flushed to memory, should the compiler know to insert that flush for your or is reading a stale value ok?

Any value would be OK. There are aliasing rules to follow, and it's OK if those crater performance when you start using integer-to-pointer conversions a lot. Is that a problem? But in this instance, assume I don't even care.

> For what it’s worth there’s a reason you’re supposed to do this kind of access through memcpy, not by dereferencing made up pointers.

Then why allow integers to be converted to pointers at all, say it's implementation defined, and meant to represent the addressing structure of the environment?

> Because the language standard is defined to target a virtual machine as output, not any given implementation …

Again, not taking about this being portable. The standard says it's implementation defined, and meant to match the addressing structure of the platform. I offered a specific platform where all of this has a specific meaning, that's all.

What's the point of specifying this, if you're then going to say _actually_ because of aliasing it's “undefined” and as soon as that magic word appears, a smart compiler that can prove it at compile time decides this code can't possibly be reached, and deletes the entire function?

What good does this bring us if it means clang can't be used to target platforms where direct memory access is a thing?

vlovich123 3 days ago | parent [-]

> What good does this bring us if it means clang can't be used to target platforms where direct memory access is a thing?

I don’t know about this specific instance, but generally disallowing aliasing enables huge performance gains. That’s something Rust does to a great amount and one of the reasons real world code bases are faster in C than C++ (the other being a much better standard library with better defaults for containers and whatnot)

bcrl 4 days ago | parent | prev | next [-]

It is important to understand why undefined behaviour has proliferated over the past ~25 years. Compiler developers are (like the rest of us) under pressure to improve metrics like the performance of compiled code. Often enough that's because a CPU vendor is the one paying for the work and has a particular target they need to reach at time of product launch, or there's a new optimization being implemented that has to be justified as showing a benefit on existing code.

The performance of compilers is frequently measured using the SPEC series of CPU benchmarks, and one of the main constraints of the series SPEC series of tests is that the source code of the benchmark cannot be changed. It is static.

As a result, compiler authors have to find increasingly convoluted ways to make it possible for various new compiler optimizations to be applied to the legacy code used in SPEC. Take 403.gcc: it's based on gcc version 3.2 which was released on August 14th 2002 -- nearly 23 years ago.

By making certain code patterns undefined behaviour, compiler developers are able to relax the constraints and allow various optimizations to be applied to legacy code in places which would not otherwise be possible. I believe the gcc optimization to eliminate NULL pointer checks when the pointer is dereferenced was motivated by such a scenario.

In the real world code tends to get updated when compilers are updated, or when performance optimizations are made, so there is no need for excessive compiler "heroics" to weasel its way into making optimizations apply via undefined behaviour. So long as SPEC is used to measure compiler performance using static and unchanging legacy code, we will continue to see compiler developers committing undefined behaviour madness.

The only way around this is for non-compiler developer folks to force language standards to prevent compilers from using undefined behaviour to do that which normal software developers considers to be utterly insane code transformations.

uecker 4 days ago | parent | next [-]

Language standards have much less power than people think and compiler-vendors are of course present in the standard working groups. Ultimately, the users need to put pressure on the compiler vendors. Please file bugs - even if this often has no effect, it takes away the argument "this is what our users want". Also please support compilers based on how they deal with UB and not on the latest benchmark posted somewhere.

bcrl 4 days ago | parent [-]

Language standards have plenty of power over compiler vendors, however, very few people that are not involved in writing compilers tend to participate in the standards process. Standards bodies bend to the will of those participating.

pjmlp 3 days ago | parent [-]

Not really, there are several features that can be used as example of compiler vendors ignoring standard features that they deemed inconvenient to implement, or undesirable for their customers.

https://en.cppreference.com/w/cpp/compiler_support.html

https://en.cppreference.com/w/c/compiler_support.html

pjmlp 4 days ago | parent | prev [-]

Dr. Dobbs used to have articles with those benchmarks, here are a couple of examples,

https://dl.acm.org/doi/10.5555/11616.11617

https://jacobfilipp.com/DrDobbs/articles/DDJ/1991/9108/9108h...

jcranmer 4 days ago | parent | prev [-]

In a compiler, you essentially need the ability to trace all the uses of an address, at least in the easy cases. Converting a pointer to an integer (or vice versa) isn't really a deal-breaker; it's essentially the same thing as passing (or receiving) a pointer to an unknown external function: the pointer escapes, whelp, nothing more we can do in that case for the most part.

But converting an integer to a pointer creates a problem if you allow that pointer to point to anything--it breaks all of the optimizations that assumed they could trace all of the uses of an address. So you need something like provenance to say that certain back-conversions are illegal. The most permissive model is a no-address-taken model (you can't forge a pointer to a variable whose address was never taken). But most compilers opt instead for a data-dependency-based model: essentially, even integer-based arithmetic of addresses aren't allowed to violate out-of-bounds at the point of dereference. Or at least, they claim to--the documentation for both gcc and llvm have this claim, but both have miscompilation bugs because they don't actually allow this.

The proposal for pointer provenance in C essentially looks at how compilers generally implement things and suggests a model that's closer to their actual implementation: pointer-to-integer exposes the address such that any integer-to-pointer can point to it. Note this is more permissive than the claimed models of compilers today--you're explicitly able to violate out-of-bounds rules here, so long as both objects have had their addresses exposed. There's some resistance to this because adhering to this model also breaks other optimizations (for example, (void*)(uintptr_t)x is not the same as x).

As a practical matter, pointer provenance isn't that big of a deal. It's not hard to come up with examples that illustrate behaviors that cause miscompilation or are undefined specifically because of pointer provenance. But I'm not aware of any application code that was actually miscompiled because the compiler implemented its provenance model incorrectly. The issue gets trickier as you move into systems code that exists somewhat outside the C object model, but even then, most of the relevant code can ignore their living outside the object model since resulting miscompiles are prevented by inherent optimization barriers anyways (note that to get a miscompile, you generally have to simultaneously forge the object's address, have the object's address be known to the compiler already, and have the compiler think the object's address wasn't exposed by other means).