Remix.run Logo
quelsolaar 3 hours ago

The 5 stages of learning about UB in C:

-Denial: "I know what signed overflow does on my machine."

-Anger: "This compiler is trash! why doesn't it just do what I say!?"

-Bargaining: "I'm submitting this proposal to wg14 to fix C..."

-Depression: "Can you rely on C code for anything?"

-Acceptance: "Just dont write UB."

matheusmoreira 2 hours ago | parent | next [-]

What stage is the "just make the compiler define the undefined" stage?

Unaligned access? Packed structs. Compiler will magically generate the correct code, as if it had always known how to do it right all along! Because it has, in fact, always known how to do it right. It just didn't.

Strict aliasing? Union type punning. Literally documented to work in any compiler that matters, despite the holy C standard never saying so. Alternatively, just disable it straight up: -fno-strict-aliasing. Enjoy reinterpreting memory as you see fit. You might hit some sharp edges here and there but they sure as hell aren't gonna be coming from the compiler.

Overflow? Just make it defined: -fwrapv. Replace +, -, * with __builtin_*_overflow while you're at it, and you even get explicit error checking for free. Nice functional interface. Generates efficient code too.

The "acceptance" stage is really "nobody sane actually cares about the C standard". The standard is garbage, only the compilers matter. And it turns out that compilers have plenty of extremely useful functions that let you side step most if not all of this. People just don't use this because they want to write "portable" "standard" C. The real acceptance is to break out of that mindset.

Somehow I built an entire lisp interpreter in freestanding C that actually managed to pass UBSan just by following the above logic. I was actually surprised at first: I expected it to crash and burn, but it didn't. So if I can do it, then anyone can do it too.

gpderetta 2 hours ago | parent | next [-]

> Unaligned access? Packed structs.

Packed structs are dangerous. You can do unaligned accesses through a packed type, but once you take the address of your misaligned int field, then you are back into UB territory. Very annoying in C++ when you try to pass the a misaligned field through what happens to be generic code that takes a const reference, as it will trigger a compiler warning. Unary operator+ is your friend.

matheusmoreira an hour ago | parent [-]

> but once you take the address of your misaligned int field

Gotta work with the structure directly by taking the address of the packed structure itself.

  struct uu64 {
      u64 value;
  } __attribute__((packed));

  struct uu64 unaligned;
  struct uu64 *address = &unaligned;

  address->value; // this works

  u64 *broken = &address->value; // this doesn't
Taking the address of the field inside the structure essentially casts away the alignment information that was explicitly added to stop the compiler from screwing things up. So it should not be done.

Mercifully, both gcc and clang emit address-of-packed-member warnings if it's done. So the packed structures are effectively turning silently broken nonsense code into sensible warnings. Major win.

lelanthran 2 hours ago | parent | prev [-]

> What stage is the "just make the compiler define the undefined" stage?

It can be left as implementation defined, which means that the compiler can't simply do arbitrary things, it needs to document what it would do.

Take, for example, signed-integer overflow: currently a compiler can simply refuse to emit the code in one spot while emitting it in another spot in the same compilation unit! Making it IB means that the compiler vendor will be forced to define what happens when a signed-integer overflows, rather than just saying, as they do now, "you cannot do that, and if you do we can ignore it, correct it, replace it or simply travel back in time and corrupt your program".

> Somehow I built an entire lisp interpreter in freestanding C that actually managed to pass UBSan just by following the above logic. I was actually surprised at first: I expected it to crash and burn, but it didn't. So if I can do it, then anyone can do it too.

Same here; I built a few non-trivial things that passed the first attempt at tooling (valgrind, UBsan with tests, fuzzing, etc) with no UB issues found.

matheusmoreira 43 minutes ago | parent [-]

Completely agree. It can, and I think it's extremely annoying that it wasn't.

So we have the next best thing: builtins and flags. So long as those cover all the undefined behavior there is, we can live with it. Compiler gets to be "conformant" and we get to do useful things without the compiler folding the code into itself and inside out.

thomashabets2 2 hours ago | parent | prev | next [-]

Author here.

> -Acceptance: "Just dont write UB."

The point of my article is that this is not possible. This cannot be our end state, as long as humans are the ones writing the code. No human can avoid writing UB in C/C++.

jart 34 minutes ago | parent [-]

It's honestly not that difficult to be rigorous. The things you mentioned in the blog post are pretty obvious forms of degenerate practices once you get used to seeing them. The best way to make your argument would be to bring up pointer overflow being ub. What's great about undefined behavior is that the C language doesn't require you to care. You can play fast and loose as much as you want. You can even use implicit types and yolo your app, writing C that more closely resembles JavaScript, just like how traditional k&r c devs did back in the day under an ilp32 model. Then you add the rigor later if you care about it. For most stuff, like an experiment, we obviously don't care, but when I do, I can usually one shot a file without any UB (which I check by reading the assembly output after building it with UBSAN) except there's just one thing that I usually can't eliminate, which is the compiler generating code that checks for pointer overflow. Because that's just such a ridiculous concept on modern machines which have a 56 bit address space. Maybe it mattered when coding for platforms like i8086. I've seen almost no code that cares about this. I have to sometimes, in my C library. It's important that functions like memchr() for example don't say `for (char *p = data, *e = data + size; p<e; ...` and instead say `for (size_t i = 0; i < n; ++i) ...data[i]...`. But these are just the skills you get with mastery, which is what makes it fun. Oh speaking of which, another fun thing everyone misses is the pitfalls of vectorization. You have to venture off into UB land in order to get better performance. But readahead can get you into trouble if you're trying to scan something like a string that's at the end of a memory page, where the subsequent page isn't mapped. My other favorite thing is designing code in such a way that the stack frame of any given function never exceeds 4096 bytes, and using alloca in a bounded way that pokes pages if it must be exceeded. If you want to have a fun time experiencing why the trickiness of UB rules are the way they are, try writing your own malloc() function that uses shorts and having it be on the stack, so you can have dynamic memory in a signal handler.

1718627440 an hour ago | parent | prev | next [-]

> -Denial: "I know what signed overflow does on my machine."

Or you just not skip the introductory pages, that tell you what the language philosophy of C is, and why there is UB. Yes, UB can be a struggle, but the first four steps are entirely unnecessary. It means that you do not actually understand the core concepts of the very same language you are using, which is kinda stupid.

whizzter 31 minutes ago | parent [-]

I think the issue has been that the line between de-jure and de-facto behaviours has shifted over the years as compiler optimizations suddenly began relying on de-jure intrepretations of UB to increase performance while ignoring de-facto usage of the language.

When that started happened people became alarmed (oMG UB iS TeH BAD!) and since some old UB machines still had industry support (of organisations that actually participated in ISO meetings instead of arguing online) there was never any movement on defining de-facto usage as de-jure and the alarmist position became the default.

Personally I think the industry would've benefited from a Boring C (as described by DJB) push by people that would've created a public parallell "de-jure" standard that would've had a chance to be adopted by compiler creators.

1718627440 25 minutes ago | parent [-]

> I think the issue has been that the line between de-jure and de-facto behaviours has shifted over the years as compiler optimizations suddenly began relying on de-jure intrepretations of UB to increase performance while ignoring de-facto usage of the language.

I guess I am too young, and also too much a purist, because I start from the impression of what the language is, not what the implementations happen to do.

> Personally I think the industry would've benefited from a Boring C (as described by DJB) push by people that would've created a public parallell "de-jure" standard that would've had a chance to be adopted by compiler creators.

-O0

im3w1l 2 hours ago | parent | prev | next [-]

In C, acceptance is "I will write UB and it will eventually lead to something bad happening"

Ygg2 3 hours ago | parent | prev [-]

> -Acceptance: "Just dont write UB."

Just switch to a saner language.

And before I get attacked for being a Rust shill, I meant Java :P

The bar is so low it's floating near the center of the Earth.

dns_snek 2 hours ago | parent | next [-]

> And before I get attacked for being a Rust shill, I meant Java :P

If all you want is C but less insane then the obvious answer here is Zig.

simonask 2 hours ago | parent | next [-]

Zig is cool, but it is not even close to being ready for prime-time. It will be pre-1.0 for a while, and major breaking changes are still happening.

dns_snek 2 hours ago | parent [-]

Sure, maybe don't bet your entire company on mountains of Zig code just yet, but aside from the breaking changes it's been perfectly usable and suitable for every project I've ever wanted to work on.

AgentME an hour ago | parent | prev | next [-]

If someone is switching from C because it's too easy to trigger undefined behavior, picking one of the few other not memory safe languages is missing the point.

psychoslave 2 hours ago | parent | prev [-]

If all somebody want is a programming language than C/C++ on these matter, there are plentiful options of the shelf to pick from.

If all somebody want is a turn key replacement to C/C++ ecosystem, then there is nothing like that in the world that I’m aware of.

p2detar 3 hours ago | parent | prev | next [-]

> Just switch to a saner language.

And where's the fun in that?

psychoslave 2 hours ago | parent [-]

That’s a taste matter. Being recalled that what is expressed is always depending on some technical details on every move, this is great when one is loving technical details and have all the leisure time to pay attention to them. This is going to be hell compared to sound defaults for someone willing to focus on delivering higher order feature/functionality which will most likely work just fine.

Unedefined behaviour means "we couldn’t settle on a best default trade-off with fine-tuning as a given option so we let everyone in the unknown".

xeyownt 3 hours ago | parent | prev | next [-]

[flagged]

ErroneousBosh 2 hours ago | parent | prev [-]

Okay, so Java compiles to machine code now?

Because the last time I looked it appeared to need some godawful slow bytecode interpreter that took up thousands of kilobytes of RAM.

elch 2 hours ago | parent | next [-]

If you don't like JIT/JVM there's GraalVM Native Image.

https://www.graalvm.org/latest/reference-manual/native-image...

In the past you could use e.g. Excelsior JET.

pjc50 2 hours ago | parent | prev [-]

Java has been jitted for .. decades?

Hendrikto 40 minutes ago | parent [-]

You know what JIT means, right? It means that is is not compiled from the start and indeed runs on a bytecode interpreter until the JIT compiler kicks in.

fc417fc802 19 minutes ago | parent [-]

The java JIT has produced sufficiently fast code for all but the most demanding of HPC applications for going on 20 years. I realize keeping up with new developments can be difficult but the out of date java performance memes are entirely ridiculous by now.

Meanwhile half the world appears to run on cpython of all things.