Remix.run Logo
kazinator 4 days ago

Undefined behavior only means that ISO C doesn't give requirements, not that nobody gives requirements. Many useful extensions are instances where undefined behavior is documented by an implementation.

Including a header that is not in the program, and not in ISO C, is undefined behavior. So is calling a function that is not in ISO C and not in the program. (If the function is not anywhere, the program won't link. But if it is somewhere, then ISO C has nothing to say about its behavior.)

Correct, portable POSIX C programs have undefined behavior in ISO C; only if we interpret them via IEEE 1003 are they defined by that document.

If you invent a new platform with a C compiler, you can have it such that #include <windows.h> reformats all the attached storage devices. ISO C allows this because it doesn't specify what happens if #include <windows.h> successfully resolves to a file and includes its contents. Those contents could be anything, including some compile-time instruction to do harm.

Even if a compiler's documentationd doesn't grant that a certain instance of undefined behavior is a documented extension, the existence of a de facto extension can be inferred empirically through numerous experiments: compiling test code and reverse engineering the object code.

Moreover, the source code for a compiler may be available; the behavior of something can be inferred from studying the code. The code could change in the next version. But so could the documentation; documentation can take away a documented extension the same way as a compiler code change can take away a de facto extension.

Speaking of object code: if you follow a programming paradigm of verifying the object code, then undefined behavior becomes moot, to an extent. You don't trust the compiler anyway. If the machine code has the behavior which implements the requirements that your project expects of the source code, then the necessary thing has been somehow obtained.

throw-qqqqq 4 days ago | parent | next [-]

> Undefined behavior only means that ISO C doesn't give requirements, not that nobody gives requirements. Many useful extensions are instances where undefined behavior is documented by an implementation.

True, most compilers have sane defaults in many cases for things that are technically undefined (like take sizeof(void) or do pointer arithmetic on something other than a char). But not all of these cases can be saved by sane defaults.

Undefined behavior means the compiler can replace the code with whatever. So if you e.g. compile optimizing for size, the compiler will rip out the offending code, as replacing it with nothing yields the greatest size optimization.

See also John Regehr's collection of UB-Canaries: https://github.com/regehr/ub-canaries

Snippets of software exhibiting undefined behavior, executing e.g. both the true and the false branch of an if-statement or none etc. UB should not be taken lightly IMO...

eru 4 days ago | parent [-]

> [...] undefined behavior, executing e.g. both the true and the false branch of an if-statement or none etc.

Or replacing all you mp3s with a Rick Roll. Technically legal.

(Some old version of GHC had a hilarious bug where it would delete any source code with a compiler error in it. Something like this would technically legal for most compiler errors a C compiler could spot.)

pjmlp 4 days ago | parent | prev | next [-]

Unfortunely it also means that when the programmer fails to understand what undefined behaviour is exposed on their code, the compiler is free to take advantage of that to do the ultimate performance optimizations as means to beat compiler benchmarks.

The code change might come in something as innocent as a bug fix to the compiler.

account42 4 days ago | parent [-]

Ah yes, the good old "compiler writers only care about benchmarks and are out to hurt everyone else" nonsense.

I for one am glad that compilers can assume that things that can't happen according to the language do in fact not happen and don't bloat my programs with code to handle them.

adwn 4 days ago | parent | next [-]

> I for one am glad that compilers can assume that things that can't happen according to the language do in fact not happen and don't bloat my programs with code to handle them.

Yes, unthinkable happenstances like addition on fixed-width integers overflowing! According to the language, signed integers can't overflow, so code like the following:

    int new_offset = current_offset + 16;
    if (new_offset < current_offset)
        return -1; // Addition overflowed, something's wrong
can be optimized to the much leaner

    int new_offset = current_offset + 16;
Well, I sure am glad the compiler helpfully reduced the bloat in my program!
account42 4 days ago | parent [-]

Garbage in, garbage out. Stop blaming the compiler for your bad code.

adwn 4 days ago | parent [-]

You're objectively wrong. This code isn't bad, it's concise and fast (even without the compiler pattern-matching it to whatever overflow-detecting machine instructions happen to be available), and it would be valid and idiomatic for unsigned int. Stop blaming the code for your bad language spec.

account42 4 days ago | parent [-]

The language spec isn't bad just because it doesn't allow you to do what you want. Are you also upset that you need to add memory barriers where the memory model of the underlying platform doesn't need them?

Again, this isn't undefined behavior to fuck you over and compilers don't use it for optimizations because they hate you. It's because it makes a real difference for performance which is the primary reason low level languages are used.

If you for some reason want less efficient C++ then compilers even provide you flags to make this particular operation defined. There is no reason the rest of us have to suffer worse performance because of you.

Personally I would prefer if unsigned ints had the same undefined behavior by default with explicit functions for wrapping overflow. That would make developer intent much clearer and give tools a chance to diagnose unwanted overflow.

adwn 3 days ago | parent [-]

No, dangerous behavior should be opt-in, not opt-out. In 99.9 % of integer additions, overflow-is-UB won't make any difference performance-wise, but may still screw you over if you're unlucky. In the 0.1 % of cases where there's even the potential for a measurable speed-up, you'll want to careful craft your code anyway, and you can use explicit, UB-exploiting functions.

Rust does it right: The "+"-operator is wrapping in release builds (and throws an exception when debug-assertions are enabled), and there are explicit functions for checked/overflowing/saturating/strict/unchecked/wrapping addition [1]. The "unchecked" variant exploits UB-on-overflow and is marked as unsafe. These functions exist both for signed and unsigned integer types. Which once again shows that it's very well possible to design a sane, low-level language which is just as fast as C.

[1] https://doc.rust-lang.org/std/primitive.u32.html

titzer 4 days ago | parent | prev [-]

Moral hazard here. The rest of us, and all of society, now rests on a huge pile of code written by incorrigible misers who imagined themselves able to write perfect, bug-free code that would go infinitely fast because bad things never happen. But see, there's bugs in your code and other people pay the cost.

kazinator 4 days ago | parent | next [-]

There is an incredible amount of C out there relative to how the sky basically isn't falling.

titzer 4 days ago | parent [-]

Ransomware attacks against hospitals and a dark extortion economy churning tens if not hundreds of billions of dollars a year in losses and waste.

What would the "sky falling" look like to you? If you're expecting dramatic movie scenes like something out of Mr Robot, I'm afraid the reality is more mundane, just a never-ending series of basic programming errors that turn into remote code execution exploits because of language and compiler choices by people who don't pay the costs.

kazinator 4 days ago | parent [-]

To completely eliminate the possibility of ransomware attack, you need an incredibly locked down platform, and users who are impervious to social engineering.

Vulnerabilities to ransomware (and other forms of malware) can be perpetrated without a single bad pointer being dereferenced.

For instance, a memory-safe e-mail program can automatically open an attachment, and the memory-safe application which handles the attachment can blindly run code embedded in the document in a leaky sandbox.

There is an incredible amount of infrastructure out there that depends on C. Embedded devices, mobile devices, desktops, servers. Network stacks, telephony stacks, storage, you name it. Encryption, codecs, ...

Sky is falling would mean all of it would be falling down so badly that, for instance, you would have about a 50% chance of connecting a server that is more than four hops away.

account42 4 days ago | parent | prev [-]

There's bugs in your code without undefined behavior too. Go use a different language if you don't care about performance, there are many to choose from.

pjmlp 4 days ago | parent [-]

Not only do I care about performance, the languages I use, are able to delivery both safety and performace at the level required for project delivery.

Unfortunely too many folks still pretend C is some kind of magic portable Assembly language that no other language on Earth is able to achieve the same.

Also if I care enough about ultimate performace, like anyone that actually cares about performance, I dust off my Assembly programming skills, alongside algorithms, datastructures and computer organisation.

quietbritishjim 4 days ago | parent | prev | next [-]

> Including a header that is not in the program, and not in ISO C, is undefined behavior.

What is this supposed to mean? I can't think of any interpretation that makes sense.

I think ISO C defines the executable program to be something like the compiled translation units linked together. But header files do not have to have any particular correspondence to translation units. For example, a header might declare functions whose definitions are spread across multiple translation units, or define things that don't need any definitions in particular translation units (e.g. enum or struct definitions). It could even play macro tricks which means it declares or defines different things each time you include it.

Maybe you mean it's undefined behaviour to include a header file that declares functions that are not defined in any translation unit. I'm not sure even that is true, so long as you don't use those functions. It's definitely not true in C++, where it's only a problem (not sure if it's undefined exactly) if you ODR-rule use a function that has been declared but not defined anywhere. (Examples of ODR-rule use are calling or taking the address of the function, but not, for example, using sizeof on an expression that includes it.)

kazinator 4 days ago | parent [-]

> I can't think of any interpretation that makes sense

Start with a concrete example. A header that is not in our program, or described in ISO C. How about:

  #include <winkle.h>
Defined behavior or not? How can an implementation respond to this #include while remaining conforming? What are the limits on that response?

> But header files do not have to have any particular correspondence to translation units.

A header inclusion is just a mechanism that brings preprocessor tokens into a translation unit. So, what does the standard tell us about the tokens coming from #include <winkle.h> into whatever translation unit we put it into?

Say we have a single file program and we made that the first line. Without that include, it's a standard-conforming Hello World.

im3w1l 4 days ago | parent | next [-]

I think we are slowly getting closer to the crux of the matter. Are you saying that it's a problem to include files from a library since they are "not in our program"? What does that phrase actually mean? What is the bounds of "our program" anyway? Couldn't it be the set {main.c, winkle.h}

kazinator 4 days ago | parent [-]

> What is the bounds of our program?

N3220: 5.1.1.1 Program Structure

A C program is not required to be translated in its entirety at the same time. The text of the program is kept in units called source files, (or preprocessing files) in this document. A source file together with all the headers and source files included via the preprocessing directive #include is known as a preprocessing translation unit. After preprocessing, a preprocessing translation unit is called a translation unit. Previously translated translation units may be preserved individually or in libraries. The separate translation units of a program communicate by (for example) calls to functions whose identifiers have external linkage, manipulation of objects whose identifiers have external linkage, or manipulation of data files. Translation units may be separately translated and then later linked to produce an executable program.

> Couldn't it be the set {main.c, winkle.h}

No; in this discussion it is important that <winkle.h> is understood not to be part of the program; no such header is among the files presented for translation, linking and execution. Thus, if the implementation doesn't resolve #include <winkle.h> we get the uninteresting situation that a constraint is violated.

Let's focus on the situation where it so happens that #include <winkle.h> does resolve to something in the implementation.

quietbritishjim 4 days ago | parent [-]

The bit of the standard that you've quoted says that the program consists of all files that are compiled into it, including all files that are found by the #include directive. So, if <winkle.h> does successfully resolve to something, then it must be part of the program by definition because that's what "the program" means.

Your question about an include file that isn't part of the program just doesn't make any sense.

(Technically it says that those files together make up the "program text". As my other comment says, "program" is the binary output.)

kazinator 4 days ago | parent [-]

I see what you are getting at. Programs consist of materials that are presented to the implementation, and also of materials that come from the implementation.

So what I mean is that no file matching <winkle.h> has been presented as part of the external file set given to the implementation for processsing.

I agree that if such a file is found by the implementation it becomes part of the program, as makes sese and as that word is defined by ISO C, so it is not right terminology to say that the file is not part of the program, yet may be found.

If the inclusion is successful, though, the content of that portion of that program is not defined by ISO C.

quietbritishjim 4 days ago | parent [-]

It still seems like you have invented some notion of "program" that doesn't really exist. Most suspicious is when you say this:

> So what I mean is that no file matching <winkle.h> has been presented as part of the external file set given to the implementation for processsing.

The thing is, there is no "external file set" that includes header files, so this sentence makes no sense.

Note that when the preprocessor is run, the only inputs are the file being preprocessed (i.e., the .c file) and the list of directories to find include files (called the include path). That's not really part of the ISO standard, but it's almost universal in practice. Then the output of the preprocessor is passed to the compiler, and now it's all one flat file so there isn't even a concept of included files at this point. The object files from compilation are then passed to the linker, which again doesn't care about headers (or indeed the top-level source files). There are more details in practice (especially with libraries) but that's the essence.

I wonder if your confusion is based on seeing header files in some sort of project-like structure in an IDE (like Visual Studio). But those are just there for ease of editing - the compiler (/preprocessor) doesn't know or care which header files are in your IDE's project, it only cares about the directories in the include path. The same applies to CMake targets: you can add include files with target_sources(), but that's just to make them show up in any generated IDE projects; it has no effect on compilation.

Or are you just maybe saying that the developer's file system isn't part of the ISO C standard, so this whole textual inclusion process is by some meaning not defined by the standard? If so, I don't think that matches the conventional meaning of undefined behaviour.

If it's neither of those, could you clarify what exactly you mean by "the external file set given to the implementation for processing"?

kazinator 4 days ago | parent [-]

Let's drop the word "program" and use something else, like "project", since the word "program" is normative in ISO C.

The "project" is all the files going into a program supplied other than by the implementation.

C programs can contain #include directives. Those #include directives can be satisfied in one of three ways: they can reference a standard header which is specified by ISO C and hence effectively built into the hosted language, such as <stdio.h>.

C programs can #include a file from the project. For instance someone's "stack.c" includes "stack.h". So yes, there is an external file set (the project) which can have header files.

C programs can also #include something which is neither of the above. That something might be not found (constraint violation). Or it might be found (the implementation provides it). For instance <sys/mmap.h>: not in your project, not in ISO C.

My fictitious <winkle.h> falls into this category. (It deliberately doesn't look like a common platform-specific header coming from any well-known implementation---but that doesn't matter to the point).

> Or are you just maybe saying that the developer's file system isn't part of the ISO C standard, so this whole textual inclusion process is by some meaning not defined by the standard?

Of course, it isn't, no I'm not saying that. The C standard gives requirements as to how a program (project part and other) is processed by the implementation, including all the translation phases that include preprocessing.

To understand what the requirements are, we must consider the content of the program. We know what the content is of the project parts: that's in our files. We (usually indirectly) know the content of the standard headers, from the standard; we ensure that we have met the rules regarding their correct use and what we may or may not rely on coming form them.

We don't know the content of successfully included headers that don't come from our project or from ISO C; or, rather, we don't know that content just from knowing ISO C and our project. In ISO C, we can't find any requirements as to what is supposed to be there, and we can't find it in our project either.

If we peek into the implementation to see what #include <winkle.h> is doing (and such a peeking is usually possible), we are effectively looking at a document, and then if we infer from that document what the behavior will be, it is a documented extension --- standing in the same place as what ISO C calls undefined behavior. Alternatively, we could look to actual documentation. E.g. POSIX tells us what is in <fcntl.h> without us having to look for the file and analyze the tokens. When we use it we have "POSIX-defined" behavior.

#include <winkle.h> is in the same category of thing as __asm__ __volatile__ or __int128_t or what have you.

#include <winkle.h> could contain the token __wipe_current_directory_at_compile_time which the accompanying compiler understands and executes as soon as it parses the token. Or __make_demons_fly_out_of_nose. :)

Do you see the point? When you include a nonstandard header that is not coming from your project, and the include succeeds, anything can happen. ISO C no longer dictates the requirements as to what the behavior will be. Something unexpected can happen, still at translation time.

Now headers like <windows.h> or <unistd.h> are exactly like <winkle.h>: same undefined behavior.

quietbritishjim 4 days ago | parent [-]

> The "project" is all the files going into a program supplied other than by the implementation.

Most of my most recent comment is addressing the possibility that you meant this.

As I said, there is no such concept to the compiler. It isn't passed any list of files that could be included with #includr, only the .c files actually being compiled, and the directories containing includable files.

The fact that your IDE shows project files is an illusion. Any header files shown there are not treated differently by the compiler/preprocessor to any others. They can't be, because it's not told about them!

It's even possible to add header files to your IDE's project that are not in the include path, and then they wouldn't be picked up by #include. That's how irrelevant project files are to #include.

kazinator 4 days ago | parent [-]

There is no "compiler", "IDE" or "include path" in the wording of the ISO C standard. A set of files is somehow presented to the implementation in a way that is not specified. Needless to say, a file that is included like "globals.h" but is not the base file of a translation unit will not be indicated to the implementation as the base of a translation unit. Nevertheless it has to be somehow present, if it is required.

It doesn't seem as if you're engaging with the standard-based point I've been making, in spite of detailed elaboration.

> Any header files shown there are not treated differently by the compiler/preprocessor to any others.

This is absolutely false. Headers which are part of the implementation, such as standard-defined headers like <stdlib.h> need not be implemented as files. When the implementation processes #include <stdlib.h>, it just has to flip an internal switch which makes certain identifiers appear in their respective scopes as required.

For that reason, if an implementation provides <winkle.h>, there need not be such a file anywhere in its installation.

quietbritishjim 3 days ago | parent [-]

I only discussed things like include directories and IDEs, which are not part of the standard, because I am trying in good faith to understand how you could have come to your position. There is nothing in the standard like the "set of files is somehow presented to the implementation" (in a sense that includes header files) so I reasoned that maybe you were thinking of something outside the standard.

Instead, the standard says that the include directive:

> searches a sequence of implementation-defined places for a header ... and causes the replacement of that directive by the entire contents of the header.

(Note that it talks about simply substituting in text, not anything more magical, but that's digressing.)

It's careful to say "places" rather than "directories" to avoid the requirement that there's an actual file system, but the idea is the same. You don't pass the implementation every individual file that might need to be included, you pass in the places that hold them and a way to search them with a name.

Maybe you were confused by that part of the standard you quoted in an earlier comment.

One part of that says "The text of the program is kept in units called source files, (or preprocessing files) in this document." But the "source files" aren't really relevant to the include directive – those are the top-level files being compiled (what you've called "base files").

The next sentence you quoted says "A source file together with all the headers and source files included via the preprocessing directive #include is known as a preprocessing translation unit." But "all the headers" here is just referring to files that have been found by the search mechanism referred to above, not some explicit list.

kazinator 3 days ago | parent [-]

My position doesn't revolve around the mechanics of preprocessing. Say we have somehow given the implementation a translation unit which has #include <winkle.h>. Say we did not give the implementation a file winkle.h; we did not place such a file in any of the places where it searches for include files.

OK, now suppose that the implementation resolves #include <winkle.h> and replaces it with tokens.

The subsequent processing is what my position is concerned with.

My position is that since the standard doesn't define what those tokens are, the behavior is not defined.

In other words, a conforming implementation can respond #include <winkle.h> with any behavior whatsoever.

- It can diagnose it as not being found.

- It can replace it with the token __wipe_current_directory which that same implementation then understands as a compile-time instruction to wipe the current direcctory.

- Or any other possibility at all.

This all has to do with the fact that the header is not required to exist, but may exist, and if it does exit, it may have any contents whatsoever, including non-portable contents which that implementation understands which do weird things.

It is not required to document any of it, but if it does, that constitutes a documented extension.

A conforming implementation can support a program like this:

  #include <pascal.h>

  program HelloWorld;
  begin
    WriteLn('Hello, World!');
  end.
All that <pascal.h> has to do is regurgitate a token like __pascal_mode. This token is procesed by translation phase 7, which tells the implementation to start parsing Pascal, as an extension.
quietbritishjim 4 days ago | parent | prev [-]

Do you just meant an attempt to include a file path that couldn't be found? That's not a correct usage of the term "program" – that refers to the binary output of the compilation process, whereas you're taking about the source files that are the input to the compilation. That sounds a bit pedantic but I really didn't understand what you meant.

I just checked, and if you attempt to include a file that cannot be found (in the include path, though it doesn't use that exact term) then that's a constraint violation and the compiler is required to stop compilation and issue a diagnostic. Not undefined behaviour.

kazinator 4 days ago | parent [-]

Yes; we are more interested in the other case: it happens to be found.

What are the requirements then?

quietbritishjim 4 days ago | parent [-]

I don't get your point then. If the file is found then there is no undefined behaviour in the process of the file being included. There might be undefined behaviour in the overall translation unit after the text has been substituted in, but that's nothing to do with the preprocessor.

kazinator 4 days ago | parent [-]

> If the file is found then there is no undefined behaviour in the process of the file being included.

Correct; but processing doesn't stop there.

> There might be undefined behaviour in the overall translation unit

But what does that mean; how do you infer that there might be undefined behavior?

Does ISO C define the behavior, or does it not?

ISO C has nothing to say about what is in #include <winkle.h> if such a header is found and didn't come from the program.

Without having anything to say about what is in it, if it is found at all, ISO C cannot be giving a definition of behavior of the tokens that are substituted for that #include.

gpderetta 4 days ago | parent | prev [-]

You are basically trying to explain the difference between a conforming program and a strictly conforming one.