One has to add that from the 218 UB in the ISO C23, 87 are in the core language. From those we already removed 26 and are in progress of removing many others. You can find my latest update here (since then there was also some progress): https://www.open-std.org/jtc1/sc22/wg14/www/docs/n3529.pdf

▲ tialaramex 4 days ago | parent | next [-]

A lot of that work is basically fixing documentation bugs, labelled "ghosts" in your text. Places where the ISO document is so bad as a description of C that you would think there's Undefined Behaviour but it's actually just poorly written.

Fixing the document is worthwhile, and certainly a reminder that WG21's equivalent effort needs to make the list before it can even begin that process on its even longer document, but practical C programmers don't read the document and since this UB was a "ghost" they weren't tripped by it. Removing items from the list this way does not translate to the meaningful safety improvement you might imagine.

There's not a whole lot of movement there towards actually fixing the problem. Maybe it will come later?

▲

taneq 4 days ago | parent | next [-]

> practical C programmers don't read the document and since this UB was a "ghost" they weren't tripped by it

I would strongly suspect that C compiler implementers very much do read the document, though. Which, as far as I can see, means "ghosts" could easily become actual UB (and worse, sneaky UB that you wouldn't expect.)

▲

tialaramex 4 days ago | parent | next [-]

The previous language might cause a C compiler developer to get very confused because it seems as though they can choose something else but what it is isn't specified, but almost invariably eventually they'll realise oh, it's just badly worded and didn't mean "should" there.

It's like one of those tricky self-referential parlor box statements. "The statement on this box is not true"? Thanks I guess. But that's a game, the puzzles are supposed to be like that, whereas the mission of the ISO document was not to confuse people, so it's good that it is being improved.

▲

uecker 4 days ago | parent [-]

Most of the "ghosts" are indeed just cleaning up the wording. But compiler writers historically often used any excuse that the standard is not clear to justify aggressive optimization. This starts with an overreaching interpretation of UB itself, to wacky concepts such as time-travel, wobbly numbers, incorrect implementation of aliasing (e.g. still in clang), and pointer-to-integer round trips.

▲

tialaramex 4 days ago | parent [-]

I'm sure the compiler authors will disagree that they were "using any excuse". From their point of view they were merely making transformations between equivalent programs, and so any mistake is either that these are not in fact equivalent programs because they screwed up - which is certainly sometimes the case - or the standard should not have said they were equivalent but it did.

One huge thing they have on their side is that their implementation is concrete. Whatever it is that, say, GCC does is de facto actually a thing a compiler can do. The standards bodies (and WG21 has been worse by some margin, but they're both guilty) may standardize anything, but concretely the compiler can only implement some things. "Just do X" where X isn't practical works fine on paper but is not implementable. This was the fate of the Consume ordering. Consume/ Release works fine on paper, you "just" need to have whole program analysis to implement it. Well of course that's not practical so it's not implemented.

▲

uecker 4 days ago | parent [-]

They sometimes screwed up, sometimes just because of bugs, or because different optimization passes had different assumptions that are inconsistent. This somehow contradicts your second point. Compiler have something things implemented which may be concrete on some sense (because it is in a compiler), but still not really a "thing" because it is a mess nobody can formalize using a coherent set of rules.

But then, they also sometimes misread the standard in ways I can't really understand. This often can be seen when the "interpretation" changes over time. Earlier compilers (or even earlier parts of the same compiler) implement the standard as written, some new optimization pass has some creative interpretation.

▲

tialaramex 4 days ago | parent [-]

Certainly compiler developers are only human, and many of them write C++ so they're humans working with a terrible programming language, I wouldn't sign up for that either (I have written small contributions to compilers, but not in C++). I still don't see "any excuses". I see more usual human laziness and incompetence, LLVM for example IMNSHO doesn't work hard enough to ensure their IR has coherent semantics and to deliver on those semantics.

The compiler bug I'm most closely following, and which I suspect you have your eye on too is: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119472 aka https://github.com/rust-lang/rust/issues/107975 https://github.com/llvm/llvm-project/issues/45725

But it seems like it's just that everybody fucked this up in similar ways, that's two different major compiler backends! I wouldn't be surprised if Microsoft (whose code we can't see) find that they don't get this quite right either.

▲

uecker 4 days ago | parent [-]

I do not imply bad intentions, but I see the arguments brought forward in WG14. It got better in recent years but we had to push back against some rather absurd interpretations of the standard, e.g. that unspecified values can change after they are selected. Your example shows something else, the standard is simply not seemed to be very important. The standard is perfectly clear how pointer comparison works, and yet this is not alone reason enough to invest resources into fixing this if this is not shown to cause actual problems in real code.

▲

tialaramex 3 days ago | parent [-]

> [... absurd interpretations] unspecified values can change after they are selected

It seems hard to not have this without imputing a freeze semantic which would be expensive on today's systems. Maybe I don't understand what you mean ? Rust considered insisting on the freeze semantic and was brought up short by the cost as I understand it.

▲

uecker 3 days ago | parent [-]

I do not see how this adds a substantial cost and it is required for C programs to work correctly and the C standard carefully describes the exact situation where unspecified values are chosen - so the idea that the compiler is then free to break this is clearly in contradiction to the wording. Clang got this wrong and I assume mostly fixed it, because non-frozen values caused a lot inconsistency and other bugs.

▲

tialaramex 3 days ago | parent [-]

Like I said, maybe I'm not understanding which "unspecified values" we're talking about. The freeze semantic is a problem when we've said only that we don't know what value is present (typically one or more mapped but unwritten bytes) and so since we never wrote to this RAM the underpinning machine feels confident to just change what is there. Which means saying "No" isn't on the compiler per se. The OS and machine (if virtual) might be changing this anyway. If you know of Facebook's magic zero string terminator bug, that's the sort of weird symptom you get.

But maybe you're talking about something else entirely?

▲

uecker 3 days ago | parent [-]

No, but jemalloc uses a kernel API that has the behavior and IMHO is is then non-conforming (when using this API, which I think is configurable). The Facebook bug should be taken as a clear sign that this behavior is a terrible idea and not something to be even blessed by modifying the standard. When the original kernel API was introduced, it was already pointed out that the behavior is not ideal. There is no fundamental reason (including performance reasons) this has to behave in this way. It is just bad engineering.

▲

tialaramex 3 days ago | parent [-]

But far from "The compiler shouldn't allow this" what we're talking about here is platform behaviour. My impression is that virtual machines often just do this, so it may be that even your OS has no idea either.

	▲	uecker 3 days ago \| parent [-]
		Virtual machines do not change memory behind your back without your permission.. The issue with jemalloc is very specific problem with a specific Linux API, i.e. MADV_FREE that has the problematic behavior, i.e. it reallocates pages when written-to but not already when accessed. When using this API, jemalloc is not conforming implementation of malloc. We can not weaken semantics of language semantics everytime someone implements something broken. Why MADV_FREE behaves like this is unclear to me, it was criticized the moment it was introduced into the kernel. But the main problem is using it for a memory allocator in C.

▲

Sharlin 4 days ago | parent | prev [-]

If I understand correctly, the "ghosts" are vacuously UB. As in, the standard specifies that if X, then UB, but X can in fact never be true according to the standard.

▲

uecker 4 days ago | parent | prev [-]

Fixing the actual problems is work-in-progress (as my document also indicates), but naturally it is harder.

But the original article also complains about the number of trivial UB.

▲ ncruces 4 days ago | parent | prev [-]

And yet, I see P1434R0 seemingly trying to introduce new undefined behavior, around integer-to-pointer conversions, where previously you had reasonably sensible implementation defined behavior (the conversions “are intended to be consistent with the addressing structure of the execution environment").

https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2019/p14...

▲ gpderetta 4 days ago | parent [-]

Pointer provenance already existed before, but the standards were contradictory and incomplete. This is an effort to more rigorously nail down the semantics.

i.e., the UB already existed, but it was not explicit had to be inferred from the whole text and the boundaries were fuzzy. Remember that anything not explicitly defined by the standard, is implicitly undefined.

Also remember, just because you can legally construct a pointer it doesn't mean it is safe to dereference.

▲ ncruces 4 days ago | parent | next [-]

The current standard still says integer-to-pointer conversions are implementation defined (not undefined) and furthermore "intended to be consistent with the addressing structure of the execution environment" (that's a direct quote).

I have an execution environment, Wasm, where doing this is pretty well defined, in fact. So if I want to read the memory at address 12345, which is within bounds of the linear memory (and there's a builtin to make sure), why should it be undefined behavior?

And regarding pointer provenance, why should going through a pointer-to-integer and integer-to-pointer conversions try to preserve provenance at all, and be undefined behavior in situations where that provenance is ambiguous?

The reason I'm using integer (rather than pointer) arithmetic is precisely so I don't have to be bound by pointer arithmetic rules. What good purpose does it serve for this to be undefined (rather than implementation defined) beyond preventing certain programs to be meaningfully written at all?

I'm genuinely curious.

▲

uecker 4 days ago | parent | next [-]

I fully agree with your analysis but compilers writers did think the could bend the rules, hence it was necessary to clarify that pointer-to-integer casts do work as intended. This still not in ISO C 23 btw because some compiler vendors did argue against it. But it is a TS now. If you are, please file bugs against your compilers.

▲

gpderetta 4 days ago | parent [-]

Do you fully agree? I finally went and read n3005.pdf. The important item there is that a cast to integer exposes the pointer and now the compiler must be conservative and assume that the pointed object might be changed via non trackable pointers. This seems quite a reasonable compromise to make existing code work without affecting the vast majority of objects whose address is never cast to an integer. But ncruces wants defined semantics for arbitrary forged pointers.

▲

uecker 4 days ago | parent [-]

You are right, I wasn't thinking straight. I do not fully agree. Creating arbitrary pointers can not work. Forging pointers to implementation-defined memory region would be ok though.

▲

ncruces 4 days ago | parent [-]

Why can't it work though?

And I'm taking about both things.

Integer arithmetic that produces pointers that are just out of bounds of an object. Why can't this work? Why can't the compiler assume that, since I explicitly converted a pointer to an integer, the pointed-to object can't be put into a register, or made to go out or scope early?

Second, fabricating pointers. If I have a pointer to mmap/sbrk memory, shouldn't I be allowed to “fabricate” arbitrary pointers from integers that point into that area? If not, why not?

Finally Wasm. The linear memory is addressable from address 0 to __builtin_wasm_memory_size * PAGESIZE. Given this, and except maybe the address at zero, why should it be undefined behavior to dereference any other address?

What's the actual advantage to making these undefined behavior? What to we gain in return?

▲

gpderetta 3 days ago | parent | next [-]

In practice if you do a volatile read at an arbitrary mapped address it will work. But you have no guarantee regarding what you will read from it, even it if it happens to match the address of a variable you just wrote into.

Formally it is undefined, as there is no way to give it sane semantics and it will definitely trip sanitizers and similar memory safety tools.

▲

ncruces 3 days ago | parent [-]

So how can you implement an allocator in the language itself? Not even talking about malloc (which gets to have language blessed semantics); say an arena allocator.

You get a bunch of memory from mmap. There are no “objects” (in C terminology) in there, and a single “provenance” (if at all, you got it from a syscall, which is not part of the language).

If arbitrary integer/pointer math inside that buffer is not OK, how do to get heterogeneous allocations from the arena to work? When do slices of it become “objects” and gain any other ”provenance”?

Is C supposed to be the language you can't write an arena allocator in (or a conservative GC, or…)?

	▲	gpderetta 3 days ago \| parent \| next [-]
		Handling allocators correctly it is actually quite problematic. In C++ you would placement-new into the raw storage, which ends the lifetime of whatever was there and start the lifetime of a new object, and as long as you use the pointer returned by operator new (or use std::launder), formally you are ok. Famously you cannot implement an allocator in C into static named storage; I understand that on anonymous memory (like that returned by sbrk or mmap, or an upstream allocator) it should work fine, but, as a C++ programmer, I'm not familiar with the specific details of the C lifetime model that allow it. I understand that stores into anonymous memory can change the dynamic type (or whatever is the C equivalent) of the object. In any case the issue is around object lifetimes and aliasing instead of pointer provenance: you can treat the anonymous memory is just a char array and you can safely form pointers into it and will carry the correct provenance information.
	▲	pjmlp 3 days ago \| parent \| prev [-]
		You don't that is why people that never bothered with ISO C legalese have this idea of several tricks being C, when they are in practice "Their C Compiler" language. Several things in C, if you want to stay within the guarantees of ISO C for portable code, have to be written in straight Assembly, not even inline, as the standard only defines a asm keyword must exist, leaving all other details to the implementation.

▲

uecker 3 days ago | parent | prev [-]

Integer arithmetic working correctly, converting a pointer to an integer, and converting an integer back to the same object is something which should work. This is what we made sure to guarantee in the provenance TS. Making this work for memory outside of any object or converting back to a different object whose address was not converted to an integer previously is difficult. It can be made to work, but then you need to give up a lot (and rewrite your compilers). Outside of an object is clear, because there might be no memory mapped. Converting back to an arbitrary object does not work because the compiler must know somehow that it can not put the object into a register. If you allow conversion back to arbitrary objects, it can not put anything into registers. This would be bad.

Fabricating pointers in an implementation-defined way is obviously ok. This would cover mmap / sbrk or mapped I/O. Note also that historically, the C standard left things UB exactly so that implementations can use it for extensions (e.g. mmap). The idea that UB == invalid program is fairly recent misinformation, but we have to react to it and make things at least implementation-defined (which also meant a bit something else before).

	▲	ncruces 3 days ago \| parent [-]
		> Note also that historically, the C standard left things UB exactly so that implementations can use it for extensions (e.g. mmap). The idea that UB == invalid program is fairly recent misinformation, but we have to react to it and make things at least implementation-defined (which also meant a bit something else before). I'll just finish by saying: yes please. And thank you for bearing with me.

▲

SkiFire13 4 days ago | parent | prev | next [-]

> I have an execution environment, Wasm, where doing this is pretty well defined, in fact. So if I want to read the memory at address 12345, which is within bounds of the linear memory (and there's a builtin to make sure), why should it be undefined behavior?

How would you define it? Especially in a way that is consistent with the rest of the language and allows common optimizations (remember that C supports variables, which may or may not be stored in memory)?

▲

ncruces 4 days ago | parent [-]

Just read whatever is at address 12345 of the linear memory. Doesn't matter what that is. If it's an object, if it was malloc'ed, if it's the "C stack", a "global".

It's the only way to interpret *(uint64_t*)(12345) when the standard says that a integer-to-pointer conversion is "intended to be consistent with the addressing structure of the execution environment".

There exists an instruction to do that load in Wasm, there's a builtin to check that 12345 points to addressable memory, the load is valid at the assembly level, the standard says the implementation should define this to be consistent with the addressing structure of the execution environment, why the heck are we playing games and allowing the compiler to say, "nope, that's not valid, so your entire program is invalid, and we can do what ever we want, no diagnostic required"?

▲

vlovich123 4 days ago | parent [-]

If a newer version of that value is also stored in a register and not yet flushed to memory, should the compiler know to insert that flush for your or is reading a stale value ok?

For what it’s worth there’s a reason you’re supposed to do this kind of access through memcpy, not by dereferencing made up pointers.

> There exists an instruction to do that load in Wasm, there's a builtin to check that 12345 points to addressable memory, the load is valid at the assembly level, the standard says the implementation should define this to be consistent with the addressing structure of the execution environment, why the heck are we playing games and allowing the compiler to say, "nope, that's not valid, so your entire program is invalid, and we can do what ever we want, no diagnostic required"?

Because the language standard is defined to target a virtual machine as output, not any given implementation. That virtual machine is then implemented on various platforms, but the capabilities of the underlying system aren’t directly accessible - they are only there to implement the C virtual machine. That’s why C can target so many different target machines.

▲

ncruces 3 days ago | parent [-]

> If a newer version of that value is also stored in a register and not yet flushed to memory, should the compiler know to insert that flush for your or is reading a stale value ok?

Any value would be OK. There are aliasing rules to follow, and it's OK if those crater performance when you start using integer-to-pointer conversions a lot. Is that a problem? But in this instance, assume I don't even care.

> For what it’s worth there’s a reason you’re supposed to do this kind of access through memcpy, not by dereferencing made up pointers.

Then why allow integers to be converted to pointers at all, say it's implementation defined, and meant to represent the addressing structure of the environment?

> Because the language standard is defined to target a virtual machine as output, not any given implementation …

Again, not taking about this being portable. The standard says it's implementation defined, and meant to match the addressing structure of the platform. I offered a specific platform where all of this has a specific meaning, that's all.

What's the point of specifying this, if you're then going to say _actually_ because of aliasing it's “undefined” and as soon as that magic word appears, a smart compiler that can prove it at compile time decides this code can't possibly be reached, and deletes the entire function?

What good does this bring us if it means clang can't be used to target platforms where direct memory access is a thing?

	▲	vlovich123 3 days ago \| parent [-]
		> What good does this bring us if it means clang can't be used to target platforms where direct memory access is a thing? I don’t know about this specific instance, but generally disallowing aliasing enables huge performance gains. That’s something Rust does to a great amount and one of the reasons real world code bases are faster in C than C++ (the other being a much better standard library with better defaults for containers and whatnot)

▲

bcrl 4 days ago | parent | prev | next [-]

It is important to understand why undefined behaviour has proliferated over the past ~25 years. Compiler developers are (like the rest of us) under pressure to improve metrics like the performance of compiled code. Often enough that's because a CPU vendor is the one paying for the work and has a particular target they need to reach at time of product launch, or there's a new optimization being implemented that has to be justified as showing a benefit on existing code.

The performance of compilers is frequently measured using the SPEC series of CPU benchmarks, and one of the main constraints of the series SPEC series of tests is that the source code of the benchmark cannot be changed. It is static.

As a result, compiler authors have to find increasingly convoluted ways to make it possible for various new compiler optimizations to be applied to the legacy code used in SPEC. Take 403.gcc: it's based on gcc version 3.2 which was released on August 14th 2002 -- nearly 23 years ago.

By making certain code patterns undefined behaviour, compiler developers are able to relax the constraints and allow various optimizations to be applied to legacy code in places which would not otherwise be possible. I believe the gcc optimization to eliminate NULL pointer checks when the pointer is dereferenced was motivated by such a scenario.

In the real world code tends to get updated when compilers are updated, or when performance optimizations are made, so there is no need for excessive compiler "heroics" to weasel its way into making optimizations apply via undefined behaviour. So long as SPEC is used to measure compiler performance using static and unchanging legacy code, we will continue to see compiler developers committing undefined behaviour madness.

The only way around this is for non-compiler developer folks to force language standards to prevent compilers from using undefined behaviour to do that which normal software developers considers to be utterly insane code transformations.

▲

uecker 4 days ago | parent | next [-]

Language standards have much less power than people think and compiler-vendors are of course present in the standard working groups. Ultimately, the users need to put pressure on the compiler vendors. Please file bugs - even if this often has no effect, it takes away the argument "this is what our users want". Also please support compilers based on how they deal with UB and not on the latest benchmark posted somewhere.

▲

bcrl 4 days ago | parent [-]

Language standards have plenty of power over compiler vendors, however, very few people that are not involved in writing compilers tend to participate in the standards process. Standards bodies bend to the will of those participating.

	▲	pjmlp 3 days ago \| parent [-]
		Not really, there are several features that can be used as example of compiler vendors ignoring standard features that they deemed inconvenient to implement, or undesirable for their customers. https://en.cppreference.com/w/cpp/compiler_support.html https://en.cppreference.com/w/c/compiler_support.html

▲

pjmlp 4 days ago | parent | prev [-]

Dr. Dobbs used to have articles with those benchmarks, here are a couple of examples,

https://dl.acm.org/doi/10.5555/11616.11617

https://jacobfilipp.com/DrDobbs/articles/DDJ/1991/9108/9108h...

▲

jcranmer 4 days ago | parent | prev [-]

In a compiler, you essentially need the ability to trace all the uses of an address, at least in the easy cases. Converting a pointer to an integer (or vice versa) isn't really a deal-breaker; it's essentially the same thing as passing (or receiving) a pointer to an unknown external function: the pointer escapes, whelp, nothing more we can do in that case for the most part.

But converting an integer to a pointer creates a problem if you allow that pointer to point to anything--it breaks all of the optimizations that assumed they could trace all of the uses of an address. So you need something like provenance to say that certain back-conversions are illegal. The most permissive model is a no-address-taken model (you can't forge a pointer to a variable whose address was never taken). But most compilers opt instead for a data-dependency-based model: essentially, even integer-based arithmetic of addresses aren't allowed to violate out-of-bounds at the point of dereference. Or at least, they claim to--the documentation for both gcc and llvm have this claim, but both have miscompilation bugs because they don't actually allow this.

The proposal for pointer provenance in C essentially looks at how compilers generally implement things and suggests a model that's closer to their actual implementation: pointer-to-integer exposes the address such that any integer-to-pointer can point to it. Note this is more permissive than the claimed models of compilers today--you're explicitly able to violate out-of-bounds rules here, so long as both objects have had their addresses exposed. There's some resistance to this because adhering to this model also breaks other optimizations (for example, (void*)(uintptr_t)x is not the same as x).

As a practical matter, pointer provenance isn't that big of a deal. It's not hard to come up with examples that illustrate behaviors that cause miscompilation or are undefined specifically because of pointer provenance. But I'm not aware of any application code that was actually miscompiled because the compiler implemented its provenance model incorrectly. The issue gets trickier as you move into systems code that exists somewhat outside the C object model, but even then, most of the relevant code can ignore their living outside the object model since resulting miscompiles are prevented by inherent optimization barriers anyways (note that to get a miscompile, you generally have to simultaneously forge the object's address, have the object's address be known to the compiler already, and have the compiler think the object's address wasn't exposed by other means).

▲ JonChesterfield 4 days ago | parent | prev [-]

Pointer provenance was certainly not here in the 80s. That's a more modern creation seeking to extract better performance from some applications at a cost of making others broken/unimplementable.

It's not something that exists in the hardware. It's also not a good idea, though trying to steer people away from it proved beyond my politics.

▲ jcranmer 4 days ago | parent | next [-]

Pointer provenance probably dates back to the 70s, although not under that name.

The essential idea of pointer provenance is that it is somehow possible to enumerate all of the uses of a memory location (in a potentially very limited scope). By the time you need to introduce something like "volatile" to indicate to the compiler that there are unknown uses of a variable, you have to concede the point that the compiler needs to be able to track all the known uses within a compiler--and that process, of figuring out known uses, is pointer provenance.

As for optimizations, the primary optimization impacted by pointer provenance is... moving variables from stack memory to registers. It's basically a prerequisite for doing any optimization.

The thing is that traditionally, the pointer provenance model of compilers is generally a hand-wavey "trace dataflow back to the object address's source", which breaks down in that optimizers haven't maintained source-level data dependency for a few decades now. This hasn't been much of a problem in practice, because breaking data dependencies largely requires you to have pointers that have the same address, and you don't really run into a situation where you have two objects at the same address and you're playing around with pointers to their objects in a way that might cause the compiler to break the dependency, at least outside of contrived examples.

▲

JonChesterfield 4 days ago | parent [-]

My grievance isn't with aliasing or dataflow, it's with a pointer provenance model which makes assumptions which are inconsistent with reality, optimises based on it, then justifies the nonsense that results with UB.

When the hardware behaviour and the pointer provenance model disagree, one should change the model, not change the behavior of the program.

▲

jcranmer 4 days ago | parent [-]

Give me an example of a program that violates pointer provenance (and only pointer provenance) that you think should be allowed under a reasonable programming model.

▲

JonChesterfield 3 days ago | parent [-]

This is rather woven in with type themed alias analysis which makes a hard distinction tricky. E.g realloc doesn't work under either, but the provenance issue probably only shows up under no-strict-aliasing.

I like pointer tagging because I like dynamic language implementations. That tends to look like "summon a pointer from arithmetic", which will have unknown to the compiler provenance, which is where the deref without provenance is UB demon strikes.

▲

jcranmer 3 days ago | parent [-]

I think you're misunderstanding pointer provenance, and you're being angry at a model that doesn't exist.

The failure mode of pointer provenance is converting an integer to a pointer to an object that was never converted to an integer. Tricks like packing integers into unknown bits or packing pointers into floating-point NaNs don't violate pointer provenance--it's really no different from passing a pointer to an external function call and getting it back from a different external function call.

▲

JonChesterfield 3 days ago | parent [-]

That's definitely possible. The UB if no provenance information is available belief comes from https://www.cl.cam.ac.uk/~pes20/cerberus/clarifying-provenan..., in particular

> access via a pointer value with empty provenance is undefined behaviour

I'm annoyed that casting an aligned array of bytes to a pointer to a network packet type is forbidden, and that a pointer to float can't be cast to a pointer to a simd vector of float, and that malloc cant be written in C, but perhaps those aren't provenance either.

	▲	jcranmer 3 days ago \| parent [-]
		> The UB if no provenance information is available belief comes from https://www.cl.cam.ac.uk/~pes20/cerberus/clarifying-provenan..., in particular That's an old document. In particular, it's largely arguing for a PVI provenance model (i.e., integers carry provenance information), whereas the current TS is relying on a PNVI provenance model (i.e., integers do not carry provenance information). https://www.open-std.org/jtc1/sc22/wg14/www/docs/n2577.pdf is the last draft pre-TS-ification (i.e., has all the background information to understand it). > I'm annoyed that casting an aligned array of bytes to a pointer to a network packet type is forbidden, and that a pointer to float can't be cast to a pointer to a simd vector of float, and that malloc cant be written in C, but perhaps those aren't provenance either. That's all strict aliasing rules, not pointer provenance rules. (Well, malloc has issues with living in the penumbra of the C object model). The big thing that provenance prevents you from doing is writing memcpy in C (since char access of a pointer counts as exposing the pointer, whereas the PNVI model makes memcpy a non-exposing operation).

▲ tialaramex 4 days ago | parent | prev | next [-]

> It's not something that exists in the hardware

This is sort of on the one hand not a meaningful claim, and then on the other hand not even really true if you squint anyway?

Firstly the hardware does not have pointers. It has addresses, and those really are integers. Rust's addr() method on pointers gets you just an address, for whatever that's worth to you, you could write it to a log maybe if you like ?

But the Morello hardware demonstrates CHERI, an ARM feature in which a pointer has some associated information that's not the address, a sort of hardware provenance.

▲ gpderetta 4 days ago | parent | prev | next [-]

I'm not a compiler writer, but I don't know how you would be able to implement any optimization while allowing arbitrary pointer forging and without whole-program analysis.

▲ JonChesterfield 4 days ago | parent | next [-]

It's an interesting question.

Say you're working with assembly as your medium, on a von neumann machine. Writing to parts of the code section is expected behaviour. What can you optimise in such a world? Whatever cannot be observed. Which might mean replacing instructions with sequences of the same length, or it might mean you can't work out anything at all.

C is much more restricted. The "function code" isn't there, forging pointers to the middle of a function is not a thing, nor is writing to one to change the function. Thus the dataflow is much easier, be a little careful with addresses of starts of functions and you're good.

Likewise the stack pointer is hidden - you can't index into the caller's frame - so the compiler is free to choose where to put things. You can't even index into your own frame so any variable whose address is not taken can go into a register with no further thought.

That's the point of higher level languages, broadly. You rule out forms of introspection, which allows more stuff to change.

C++ has taken this too far with the object model in my opinion but the committee disagrees.

▲ ncruces 4 days ago | parent | prev [-]

Why? What specific optimization do you have in mind that prevents me from doing an aligned 16/32/64-byte vector load that covers the address pointed to by a valid char*?

▲ gpderetta 4 days ago | parent | next [-]

Casting a char pointer to a vector pointer and doing vector loads doesn't violate provenance, although it might violate TBAA.

Regarding provenance, consider this:

  void bar();
  int foo() {
    int * ptr = malloc(sizeof(int));
    *ptr = 10;
    bar();
    int result = *ptr;
    free(ptr);
    return result;
  }

If the compiler can track the lifetime of the dynamically allocated int, it can remove the allocation and covert this function to simply

  int foo() { 
      bar();
      return 10;
  }

It can't if arbitrary code (for example inside bar()) can forge pointers to that memory location. The code can seem silly, but you could end up with something similar after inlining.

▲

rurban 3 days ago | parent | next [-]

Then show me the compiler which tells the user that it removed this dead code. There is even an assignment removed, which violates all expectations

▲

torstenvl 4 days ago | parent | prev [-]

> It can't if arbitrary code (for example inside bar()) can forge pointers to that memory location.

Yes. It absolutely can. What are you even talking about?

C is not the Windows Start Menu. This habit of thinking it needs to do what it thinks I might expect instead of what I told it is deeply psychotic.

▲

gpderetta 4 days ago | parent [-]

I litterally have no idea what are you trying to say. Do you mean that bar should be allowed to access *ptr with impunity or not?

▲

torstenvl 4 days ago | parent [-]

I'm not trying to say anything. I said and meant exactly what I said. No more, no less. Your logic is obviously flawed. There is nothing preventing that optimization in the presence of a forged pointer in bar().

▲

gpderetta 4 days ago | parent [-]

Either there is no provenance, forging is allowed and the optimization is disallowed; or there is provenance and forging the pointer and attempting to inspect (or modify) the value of *ptr in bar() is UB.

	▲	ncruces 4 days ago \| parent \| next [-]
		You never converted ptr to an integer. If you did, if the pointer escapes, yes, I claim that then the allocation can't be optimized away. Why is that so bad?
	▲	torstenvl 4 days ago \| parent \| prev [-]
		Attempting to inspect or modify the value of *ptr in bar() through a forged pointer was always UB. You are saying absolutely nothing meaningful.

▲ ncruces 4 days ago | parent | prev [-]

Can't reply to the sibling comment, for some reason.

If you don't know the extents of the object pointed to by the char*, using an aligned vector load can reach outside the bounds of the object. Keeping provenance makes that undefined behavior.

Using integer arithmetic, and pointer-to-integer/integer-to-pointer conversions would make this implementation defined, and well defined in all of the hardware platforms where an aligned vector load can never possibly fail.

So you can't do some optimizations to functions where this happens? Great. Do it. What else?

As for why you'd want to do this. C makes strings null-terminated, and you can't know their extents without strlen first. So how do you implement strlen? Similarly your example. Seems great until you're the one implementing malloc.

But I'm sure "let's create undefined behavior for a libc implemented in C" is a fine goal.

▲

gpderetta 4 days ago | parent [-]

[when there is no reply button, you need to click on the date (i.e. N minutes ago) to get the reply box]

I think your example would fall foul of reading beyond the end of an object in addition to pointer provenance. In your case the oob read is harmless as you do not expect any meaningful values for the extra bytes, but generally the compiler would not be able to give any guarantees about the content of the additional memory (or that the memory exists in the first place).

This specific use case could be addressed by the standard, but vectors are already out of the standard, so in practice you use whatever extension you have to use and abide to whatever additional rule the compiler requires (of course this is often underspecified). For example, on GCC simd primitives already have carve-outs for TBAA.

FWIW, a libc implementation in practice already must rely on compiler specific, beyond the standard behaviour anyway.

▲

tialaramex 4 days ago | parent [-]

> [when there is no reply button, you need to click on the date (i.e. N minutes ago) to get the reply box]

As an off-topic aside here that might help anybody who is wondering: HN deliberately doesn't provide "Reply" for very recent comments to try to dissuade you from having the sort of urgent back-and-forth you might reasonably do in a real time chat system, and less reasonably attempt (and likely regret) on platforms like Twitter.

A brief window to think about the thing you just read might cause you to write something more thoughtful, and even to realise that it wasn't saying what you had thought in the first place.

My favourite example was an example where somebody said a feature means "less typing" and another comment insisted it did not, and I was outraged until I realised all that's happening is that one person thinks "Typing" means "You know, pressing keys on your keyboard" and the other person thinks "Typing" means "You know, why an integer is different from a float in C" and so they're actually not even disagreeing the conflict is purely syntax!

	▲	gpderetta 4 days ago \| parent [-]
		Allegedly. Instead I like to think it is a reality check to remind me I'm wasting too much time on HN and should I do something productive :D

▲ lmkg 4 days ago | parent | prev [-]

It very much is something that exists in hardware. One of the major reasons why people finally discovered the provenance UB lurking in the standard is because of the CHERI architecture.

▲

AnimalMuppet 4 days ago | parent | next [-]

So it's something that exists in some hardware. Are you claiming that it exists in all hardware, and we only realized that because of CHERI? Or are you claiming that it exists in CHERI hardware, but not in others.

If it only exists in some hardware, how should the standard deal with that?

▲

lmkg 4 days ago | parent [-]

> If it only exists in some hardware, how should the standard deal with that?

Generally seems to me the C standard makes things like that UB. Signed integer overflow, for example. Implemented as wrapping two's-complement on modern architectures, defined as such in many modern languages, but UB in C due to ongoing support for niche architectures.

The issues around pointer provenance are inherent to the C abstract machine. It's a much more immediate show-stopper on architectures that don't have a flat address space, and the C abstract machine doesn't assume a flat address space because it supports architecture where that's not true. My understanding is that reflects some oddball historical architectures that aren't relevant anymore, nowadays that includes CHERI.

	▲	uecker 4 days ago \| parent [-]
		Historically, the reason was was often niche architectures. But sometimes certain behavior dies out and we can make semantics more strict. For example, two's complement is now a requirement for C. Still, we did not make signed overflow defined. The reasons are optimization and - maybe surprising for some - safety. UB can be used to insert the compile-time checks we need to make things safe, but often we can not currently require everyone to do this. At the same time, making things defined may make things worse. For example, finding wraparound bugs in unsigned arithmetic - though well-defined - is a difficult and serious problem. For signed overflow, you use a compiler flag and this is not exploitable anymore (could still be a DoS).

▲

pjmlp 4 days ago | parent | prev [-]

People keep forgetting that SPARC ADI did it first with hardware memory tagging for C.