Remix.run Logo
beeforpork 3 hours ago

The UB in unaligned pointers is even worse: an unaligned pointer in itself is UB, not only an access to it. So even implicit casting a void*v to an int*i (like 'i=v' in C or 'f(v)' when f() accepts an int*) is UB if the cast pointer is not aligned to int.

It is important to understand that this is a C level problem: if you have UB in your C program, then your C program is broken, i.e., it is formally invalid and wrong, because it is against the C language spec. UB is not on the HW, it has nothing to do with crashes or faults. That cast from void* to int* most likely corresponds to no code on the HW at all -- types are in C only, not on the HW, so a cast is a reinterpretation at C level -- and no HW will crash on that cast (because there is not even code for it). You may think that an integer value in a register must be fine, right? No, because it's not about pointers actually being integers in registers on your HW, but your C program is broken by definition if the cast pointer is unaligned.

stilley2 3 minutes ago | parent | next [-]

Does that mean that if I have a struct with #pragma pack(push, 1) I can't use pointers to any members that don't happen to be aligned?

thomashabets2 2 hours ago | parent | prev | next [-]

Author here.

> an unaligned pointer in itself is UB

Yup. Per the "Actually, it was UB even before that" section in the post.

> UB is not on the HW, it has nothing to do with crashes or faults

Yeah. I tried to convey this too, but I'm also addressing the people who say "but it's demonstrably fine", by giving examples. Because it's not.

account42 2 hours ago | parent | prev | next [-]

Which is totally fine and expected for any decent programmer. Casting pointers is clearly here be dragons territory.

simonask 2 hours ago | parent [-]

Many, many programmers come to C (and C++) with a lower-level understanding that actually gets in the way here. They understand that all types "are" just bytes and that all pointers "are" just register-sized integer addresses, because that's how the hardware works and has worked for decades.

It's perfectly reasonable to expect any load through `int*` to just load 4 bytes from memory, done and done. They get surprised that it is far from the whole story, and the result is UB.

Meanwhile, the actual computers we have been using for decades have no problems actually just loading 4 bytes through any arbitrary pointer with zero overhead. But no.

lelanthran 2 hours ago | parent | next [-]

> They understand that all types "are" just bytes and that all pointers "are" just register-sized integer addresses, because that's how the hardware works and has worked for decades.

I'd clarify this with "They understand that all values are just bytes".

> Meanwhile, the actual computers we have been using for decades have no problems actually just loading 4 bytes through any arbitrary pointer with zero overhead.

It's partly the standards fault here - rather than saying "We don't know how vendors will implement this, so we shall leave it as implementation-defined", they say "We don't know how vendors will implement this, so we will leave it as undefined".

A clear majority of the UB problems with C could be fixed if the standards committee slowly moved all UB into IB. It's not that there isn't any progress (Signed twos-complement is coming, after all), it's that there is (I believe) much pushback from compiler authors (who dominate the standards) who don't want to make UB into IB.

benj111 an hour ago | parent [-]

>It's partly the standards fault here - rather than saying "We don't know how vendors will implement this, so we shall leave it as implementation-defined", they say "We don't know how vendors will implement this, so we will leave it as undefined

I'd agree to a point. I still think it's unreasonable for compiler writers to get all lawyery about precise terminology. After all "implementation defined" could still be subject to the same lawyeriness (we implemented it, ergo we define it).

To me this is an issue of culture. We need to push back against the view that UB means anything can happen, therefore the compiler can do anything.

fc417fc802 32 minutes ago | parent [-]

But it's genuinely useful. In all seriousness, are you sure you aren't perhaps just using the wrong language? At this point UB and leveraging it for optimization are core parts of the most performant C implementations.

That said, I think there are many cases where compilers could make a better effort to link UB they're optimizing against to UB that appears in the code as originally authored and emit a diagnostic or even error out. But at least we've got ubsan and friends so it seems like things are within reason if not optimal.

benj111 17 minutes ago | parent [-]

>are you sure you aren't perhaps just using the wrong language

Well I think there is a tension here. C is the language for microcontrollers and the language for high performance.

In ye olden days both groups interests were aligned because speed in C was about working with the machine. Now the UB has been highjacked for speed, that microcontroller that I'm working on, where I know and int will overflow and rely on that is UB so may be optimised out, so I then have to think about what the compiler may do.

I wouldn't say C is the wrong language. I would say there are wrong compilers though.

pjc50 2 hours ago | parent | prev [-]

Except ARM32. ARM64 doesn't guarantee it to be valid in all cases either.

tovej 2 hours ago | parent | prev [-]

But that seems obvious. You can't load an integer from an unaligned address.

It's not only C-level is it. There's no (guarantee across architectures for) machine code for that either.

codeflo 2 hours ago | parent | next [-]

> You can't load an integer from an unaligned address.

You can, and the results are machine specific, clearly defined and well-documented. Ancient ARM raises an exception, modern ARM and x86 can do it with a performance penalty. It's only the C or C++ layer that is allowed to translate the code into arbitrary garbage, not the CPU.

matheusmoreira 2 hours ago | parent | prev | next [-]

Sure you can. In many architectures it works just fine. Works perfectly in x86_64, for example. It's just a little slower.

tovej an hour ago | parent [-]

In many architectures does not mean you can. The standard is supposed to cover all architectures.

matheusmoreira 30 minutes ago | parent [-]

If some architecture traps on unaligned access, then the compiler can and should simply generate the correct code so that it loads the integer piece by piece instead. Load multiple integers and shift and mask away the irrelevant bits, done. This is exactly what modern architectures already do in hardware. Works, it's just a little slower.

This is exactly what the compilers do if you use a packed structure to access unaligned data. Works everywhere, as expected. Compilers have always known what to do, they just weren't doing it. C standard says no.

The fact is the standard is garbage and the first thing every C programmer should learn is that they can and should ignore it. There is never any reason to wonder what the standard is supposed to do. The only thing that matters is what compilers actually do.

bluGill 3 minutes ago | parent [-]

The pointer might be something you forced. The compiler needs to do the right thing but if you set the pointer to an unaligned address because you have information on the hardware you can get this undefined situation with nothing the compiler can do about it.

mbel 2 hours ago | parent | prev | next [-]

Unless your code targets some exotic architecture, like idk x86.

cataphract 35 minutes ago | parent [-]

Not really. Wait until the compiler starts vectorizing your code and using instructions requiring alignment (like the ones with A or NT in the mnemonic).

pjc50 2 hours ago | parent | prev [-]

You missed the point: the pointer existing as a value of that type at all is UB, even if you never try to access anything through it and no corresponding machine code is ever emitted.

tovej an hour ago | parent [-]

Yes? I agree with that. I don't really see the issue there. The computer will allocate data in aligned addresses, so you would have to be doing something weird to begin with to access unaligned pointers. And aligned access is always better anyway. I guess packed structs are a thing if you're really byte golfing. Maybe compressed network data would also make sense.

But then I would assume you are aware of unaligned pointers, and have a sane way to parse that data, rather than read individual parts of it from a raw pointer.

I am curious, what would be a legitimate reason for an unaligned pointer to int?