| ▲ | JonChesterfield 4 days ago |
| Pointer provenance was certainly not here in the 80s. That's a more modern creation seeking to extract better performance from some applications at a cost of making others broken/unimplementable. It's not something that exists in the hardware. It's also not a good idea, though trying to steer people away from it proved beyond my politics. |
|
| ▲ | jcranmer 4 days ago | parent | next [-] |
| Pointer provenance probably dates back to the 70s, although not under that name. The essential idea of pointer provenance is that it is somehow possible to enumerate all of the uses of a memory location (in a potentially very limited scope). By the time you need to introduce something like "volatile" to indicate to the compiler that there are unknown uses of a variable, you have to concede the point that the compiler needs to be able to track all the known uses within a compiler--and that process, of figuring out known uses, is pointer provenance. As for optimizations, the primary optimization impacted by pointer provenance is... moving variables from stack memory to registers. It's basically a prerequisite for doing any optimization. The thing is that traditionally, the pointer provenance model of compilers is generally a hand-wavey "trace dataflow back to the object address's source", which breaks down in that optimizers haven't maintained source-level data dependency for a few decades now. This hasn't been much of a problem in practice, because breaking data dependencies largely requires you to have pointers that have the same address, and you don't really run into a situation where you have two objects at the same address and you're playing around with pointers to their objects in a way that might cause the compiler to break the dependency, at least outside of contrived examples. |
| |
| ▲ | JonChesterfield 4 days ago | parent [-] | | My grievance isn't with aliasing or dataflow, it's with a pointer provenance model which makes assumptions which are inconsistent with reality, optimises based on it, then justifies the nonsense that results with UB. When the hardware behaviour and the pointer provenance model disagree, one should change the model, not change the behavior of the program. | | |
| ▲ | jcranmer 4 days ago | parent [-] | | Give me an example of a program that violates pointer provenance (and only pointer provenance) that you think should be allowed under a reasonable programming model. | | |
| ▲ | JonChesterfield 3 days ago | parent [-] | | This is rather woven in with type themed alias analysis which makes a hard distinction tricky. E.g realloc doesn't work under either, but the provenance issue probably only shows up under no-strict-aliasing. I like pointer tagging because I like dynamic language implementations. That tends to look like "summon a pointer from arithmetic", which will have unknown to the compiler provenance, which is where the deref without provenance is UB demon strikes. | | |
| ▲ | jcranmer 3 days ago | parent [-] | | I think you're misunderstanding pointer provenance, and you're being angry at a model that doesn't exist. The failure mode of pointer provenance is converting an integer to a pointer to an object that was never converted to an integer. Tricks like packing integers into unknown bits or packing pointers into floating-point NaNs don't violate pointer provenance--it's really no different from passing a pointer to an external function call and getting it back from a different external function call. | | |
| ▲ | JonChesterfield 3 days ago | parent [-] | | That's definitely possible. The UB if no provenance information is available belief comes from https://www.cl.cam.ac.uk/~pes20/cerberus/clarifying-provenan..., in particular > access via a pointer value with empty provenance is undefined behaviour I'm annoyed that casting an aligned array of bytes to a pointer to a network packet type is forbidden, and that a pointer to float can't be cast to a pointer to a simd vector of float, and that malloc cant be written in C, but perhaps those aren't provenance either. | | |
| ▲ | jcranmer 3 days ago | parent [-] | | > The UB if no provenance information is available belief comes from https://www.cl.cam.ac.uk/~pes20/cerberus/clarifying-provenan..., in particular That's an old document. In particular, it's largely arguing for a PVI provenance model (i.e., integers carry provenance information), whereas the current TS is relying on a PNVI provenance model (i.e., integers do not carry provenance information). https://www.open-std.org/jtc1/sc22/wg14/www/docs/n2577.pdf is the last draft pre-TS-ification (i.e., has all the background information to understand it). > I'm annoyed that casting an aligned array of bytes to a pointer to a network packet type is forbidden, and that a pointer to float can't be cast to a pointer to a simd vector of float, and that malloc cant be written in C, but perhaps those aren't provenance either. That's all strict aliasing rules, not pointer provenance rules. (Well, malloc has issues with living in the penumbra of the C object model). The big thing that provenance prevents you from doing is writing memcpy in C (since char access of a pointer counts as exposing the pointer, whereas the PNVI model makes memcpy a non-exposing operation). |
|
|
|
|
|
|
|
| ▲ | tialaramex 4 days ago | parent | prev | next [-] |
| > It's not something that exists in the hardware This is sort of on the one hand not a meaningful claim, and then on the other hand not even really true if you squint anyway? Firstly the hardware does not have pointers. It has addresses, and those really are integers. Rust's addr() method on pointers gets you just an address, for whatever that's worth to you, you could write it to a log maybe if you like ? But the Morello hardware demonstrates CHERI, an ARM feature in which a pointer has some associated information that's not the address, a sort of hardware provenance. |
|
| ▲ | gpderetta 4 days ago | parent | prev | next [-] |
| I'm not a compiler writer, but I don't know how you would be able to implement any optimization while allowing arbitrary pointer forging and without whole-program analysis. |
| |
| ▲ | JonChesterfield 4 days ago | parent | next [-] | | It's an interesting question. Say you're working with assembly as your medium, on a von neumann machine. Writing to parts of the code section is expected behaviour. What can you optimise in such a world? Whatever cannot be observed. Which might mean replacing instructions with sequences of the same length, or it might mean you can't work out anything at all. C is much more restricted. The "function code" isn't there, forging pointers to the middle of a function is not a thing, nor is writing to one to change the function. Thus the dataflow is much easier, be a little careful with addresses of starts of functions and you're good. Likewise the stack pointer is hidden - you can't index into the caller's frame - so the compiler is free to choose where to put things. You can't even index into your own frame so any variable whose address is not taken can go into a register with no further thought. That's the point of higher level languages, broadly. You rule out forms of introspection, which allows more stuff to change. C++ has taken this too far with the object model in my opinion but the committee disagrees. | |
| ▲ | ncruces 4 days ago | parent | prev [-] | | Why? What specific optimization do you have in mind that prevents me from doing an aligned 16/32/64-byte vector load that covers the address pointed to by a valid char*? | | |
| ▲ | gpderetta 4 days ago | parent | next [-] | | Casting a char pointer to a vector pointer and doing vector loads doesn't violate provenance, although it might violate TBAA. Regarding provenance, consider this: void bar();
int foo() {
int * ptr = malloc(sizeof(int));
*ptr = 10;
bar();
int result = *ptr;
free(ptr);
return result;
}
If the compiler can track the lifetime of the dynamically allocated int, it can remove the allocation and covert this function to simply int foo() {
bar();
return 10;
}
It can't if arbitrary code (for example inside bar()) can forge pointers to that memory location. The code can seem silly, but you could end up with something similar after inlining. | | |
| ▲ | rurban 3 days ago | parent | next [-] | | Then show me the compiler which tells the user that it removed this dead code. There is even an assignment removed, which violates all expectations | |
| ▲ | torstenvl 4 days ago | parent | prev [-] | | > It can't if arbitrary code (for example inside bar()) can forge pointers to that memory location. Yes. It absolutely can. What are you even talking about? C is not the Windows Start Menu. This habit of thinking it needs to do what it thinks I might expect instead of what I told it is deeply psychotic. | | |
| ▲ | gpderetta 4 days ago | parent [-] | | I litterally have no idea what are you trying to say. Do you mean that bar should be allowed to access *ptr with impunity or not? | | |
| ▲ | torstenvl 4 days ago | parent [-] | | I'm not trying to say anything. I said and meant exactly what I said. No more, no less. Your logic is obviously flawed. There is nothing preventing that optimization in the presence of a forged pointer in bar(). | | |
| ▲ | gpderetta 4 days ago | parent [-] | | Either there is no provenance, forging is allowed and the optimization is disallowed; or there is provenance and forging the pointer and attempting to inspect (or modify) the value of *ptr in bar() is UB. | | |
| ▲ | ncruces 4 days ago | parent | next [-] | | You never converted ptr to an integer. If you did, if the pointer escapes, yes, I claim that then the allocation can't be optimized away. Why is that so bad? | |
| ▲ | torstenvl 4 days ago | parent | prev [-] | | Attempting to inspect or modify the value of *ptr in bar() through a forged pointer was always UB. You are saying absolutely nothing meaningful. |
|
|
|
|
| |
| ▲ | ncruces 4 days ago | parent | prev [-] | | Can't reply to the sibling comment, for some reason. If you don't know the extents of the object pointed to by the char*, using an aligned vector load can reach outside the bounds of the object. Keeping provenance makes that undefined behavior. Using integer arithmetic, and pointer-to-integer/integer-to-pointer conversions would make this implementation defined, and well defined in all of the hardware platforms where an aligned vector load can never possibly fail. So you can't do some optimizations to functions where this happens? Great. Do it. What else? As for why you'd want to do this. C makes strings null-terminated, and you can't know their extents without strlen first. So how do you implement strlen? Similarly your example. Seems great until you're the one implementing malloc. But I'm sure "let's create undefined behavior for a libc implemented in C" is a fine goal. | | |
| ▲ | gpderetta 4 days ago | parent [-] | | [when there is no reply button, you need to click on the date (i.e. N minutes ago) to get the reply box] I think your example would fall foul of reading beyond the end of an object in addition to pointer provenance. In your case the oob read is harmless as you do not expect any meaningful values for the extra bytes, but generally the compiler would not be able to give any guarantees about the content of the additional memory (or that the memory exists in the first place). This specific use case could be addressed by the standard, but vectors are already out of the standard, so in practice you use whatever extension you have to use and abide to whatever additional rule the compiler requires (of course this is often underspecified). For example, on GCC simd primitives already have carve-outs for TBAA. FWIW, a libc implementation in practice already must rely on compiler specific, beyond the standard behaviour anyway. | | |
| ▲ | tialaramex 4 days ago | parent [-] | | > [when there is no reply button, you need to click on the date (i.e. N minutes ago) to get the reply box] As an off-topic aside here that might help anybody who is wondering: HN deliberately doesn't provide "Reply" for very recent comments to try to dissuade you from having the sort of urgent back-and-forth you might reasonably do in a real time chat system, and less reasonably attempt (and likely regret) on platforms like Twitter. A brief window to think about the thing you just read might cause you to write something more thoughtful, and even to realise that it wasn't saying what you had thought in the first place. My favourite example was an example where somebody said a feature means "less typing" and another comment insisted it did not, and I was outraged until I realised all that's happening is that one person thinks "Typing" means "You know, pressing keys on your keyboard" and the other person thinks "Typing" means "You know, why an integer is different from a float in C" and so they're actually not even disagreeing the conflict is purely syntax! | | |
| ▲ | gpderetta 4 days ago | parent [-] | | Allegedly. Instead I like to think it is a reality check to remind me I'm wasting too much time on HN and should I do something productive :D |
|
|
|
|
|
|
| ▲ | lmkg 4 days ago | parent | prev [-] |
| It very much is something that exists in hardware. One of the major reasons why people finally discovered the provenance UB lurking in the standard is because of the CHERI architecture. |
| |
| ▲ | AnimalMuppet 4 days ago | parent | next [-] | | So it's something that exists in some hardware. Are you claiming that it exists in all hardware, and we only realized that because of CHERI? Or are you claiming that it exists in CHERI hardware, but not in others. If it only exists in some hardware, how should the standard deal with that? | | |
| ▲ | lmkg 4 days ago | parent [-] | | > If it only exists in some hardware, how should the standard deal with that? Generally seems to me the C standard makes things like that UB. Signed integer overflow, for example. Implemented as wrapping two's-complement on modern architectures, defined as such in many modern languages, but UB in C due to ongoing support for niche architectures. The issues around pointer provenance are inherent to the C abstract machine. It's a much more immediate show-stopper on architectures that don't have a flat address space, and the C abstract machine doesn't assume a flat address space because it supports architecture where that's not true. My understanding is that reflects some oddball historical architectures that aren't relevant anymore, nowadays that includes CHERI. | | |
| ▲ | uecker 4 days ago | parent [-] | | Historically, the reason was was often niche architectures. But sometimes certain behavior dies out and we can make semantics more strict. For example, two's complement is now a requirement for C. Still, we did not make signed overflow defined. The reasons are optimization and - maybe surprising for some - safety. UB can be used to insert the compile-time checks we need to make things safe, but often we can not currently require everyone to do this. At the same time, making things defined may make things worse. For example, finding wraparound bugs in unsigned arithmetic - though well-defined - is a difficult and serious problem. For signed overflow, you use a compiler flag and this is not exploitable anymore (could still be a DoS). |
|
| |
| ▲ | pjmlp 4 days ago | parent | prev [-] | | People keep forgetting that SPARC ADI did it first with hardware memory tagging for C. |
|