The Cost of a Closure in C

▲ The Cost of a Closure in C(thephd.dev)

170 points by ingve 13 hours ago | 68 comments

▲ kazinator 4 hours ago | parent | next [-]

> It’s no wonder GCC is trying to add -ftrampoline-impl=heap to the story of GNU Nested Functions; they might be able to tighten up that performance and make it more competitive with Apple Blocks.

[disclaimer] Without brushing up on the details of this, I strongly suspect that this is about removing the need for executable stacks than performance. Allocating a trampoline on the stack rather than heap is good for efficiency.

These days, many GNU/Linux distros are disabling executable stacks by default in their toolchain configuration, both for building the distro and for the toolchain offered by the system to the user.

When you use GCC local functions, it overrides the linker behavior so that the executable is marked for executable stacks.

Of course, that is a security concession because when your stack is executable, that enables malicious remote execution code to work that relies on injecting code into the stack via a buffer overflow and tricking the process into jumping to it.

If trampolines can be allocated in a heap, then you don't need an executable stack. You do need an executable heap, or an executable dedicated heap for these allocations. (Trampolines are all the same size, so they could be packed into an array.)

Programs which indirect upon GCC local functions are not aware of the trampolines. The trampolines are deallocated naturally when the stack rolls back on function return or longjmp, or a C++ exception passing through.

Heap-allocated trampolines have an obvious deallocation problem; it would be interesting to see what strategy is used for that.

	▲	fuhsnn 2 hours ago \| parent [-]
		> Heap-allocated trampolines have an obvious deallocation problem; it would be interesting to see what strategy is used for that. With -ftrampoline-impl=heap, GCC automatically insert[1] pairs of constructor/destructor routines from libgcc which were built around mmap/munmap. [1] https://godbolt.org/z/7s5nooMPz

▲ unwind 8 hours ago | parent | prev | next [-]

This was very interesting, and it's obvious from the majority of the text that the author knows a lot about these languages, their implementation, benchmarking corners, and so on. Really!

Therefore it's very jarring with this text after the first C code example:

This uses a static variable to have it persist between both the compare function calls that qsort makes and the main call which (potentially) changes its value to be 1 instead of 0

This feels completely made up, and/or some confusion about things that I would expect an author of a piece like this to really know.

In reality, in this usage (at the global outermost scope level) `static` has nothing to do with persistence. All it does is make the variable "private" to the translation unit (C parliance, read as "C source code file"). The value will "persist" since the global outermost scope can't go out of scope while the program is running.

It's different when used inside a function, then it makes the value persist between invocations, in practice typically by moving the variable from the stack to the "global data" which is generally heap-allocated as the program loads. Note that C does not mention the existence of a stack for local variables, but of course that is the typical implementation on modern systems.

▲

gldrk 4 hours ago | parent | next [-]

>This uses a static variable to have it persist between both the compare function calls that qsort makes and the main call which (potentially) changes its value to be 1 instead of 0

The only misleading thing here is that ‘static’ is monospaced in the article (this can’t be seen on HN). Other than that, ‘static variable’ can plausibly refer to an object with a static storage duration, which is what the C standard would call it.

>moving the variable from the stack to the "global data" which is generally heap-allocated as the program loads

It is not heap-allocated because you can’t free() it. Non-zero static data is not even anonymously mapped, it is file-backed with copy-on-write.

▲

sfpotter 4 hours ago | parent | prev | next [-]

I had a completely different response reading the sentence. I've been programming in C for 20+ years and am very familiar with exactly the problem the author is discussing. When they referred to a "static variable", I understood immediately that they meant a file static variable private to the translation unit. Didn't feel contrived or made up to me at all; just a reflection of the author's expertise. Precision of language.

▲

CerryuDu 2 hours ago | parent | prev | next [-]

I'm finding myself in a weird position now, because I disagree with a whole lot of things in the blog post (well, the parts I was willing to read anyways), but calling that variable static for the sake of persistence was correct.

The fact that you are questioning the use of the term shows that you are not familiar with the ISO C standard. What the author alludes to is static storage duration. And whether or not you use the "static" keyword in that declaration (also definition), the storage duration of the object remains "static". People mostly call those things "global variables", but the proper standardese is "static storage duration". In that sense, the author was right to use "static" for the lifetime of the object.

EDIT: if you drop "static" from that declaration, what changes is the linkage of the identifier (from internal to external).

▲

pjmlp 5 hours ago | parent | prev | next [-]

The author contributes to ISO C and ISO C++ working groups, and his latest contribution was #embed.

▲

steveklabnik 4 hours ago | parent [-]

Not just that, the author is the Project Editor for WG14.

This doesn’t mean that it’s impossible to make mistakes, but still.

▲

uecker 3 hours ago | parent [-]

It means he can edit LaTeX. Of course, JeanHeyd is very qualified, but being project editor for an ISO standard does not require this.

▲

steveklabnik 3 hours ago | parent [-]

I mean, you're closer to the committee than I am, but while that is true in a literal sense, I'd assume that you all would not let someone who knew how to edit LaTeX but not know anything about C hold that position.

	▲	uecker 13 minutes ago \| parent [-]
		Assuming we have some choice. Not many people volunteer their time to this work, which is quite a lot and not much fun. Companies also do not invest a lot of resources into C.

▲

debugnik 3 hours ago | parent | prev | next [-]

It took me a second read to realise that the mention of static is a red herring. I think the author knows that the linkage is irrelevant for the rest of the explanation; it just happens to be static so they called it static. But by drawing attention to it, it does first read like they're confused about the role of static there.

▲

kreco 8 hours ago | parent | prev [-]

That's a very weird comment, your spreading your knowledge and not really addresse what could have been changed in the article.

If I follow your comment, you mean that he could have use a non-static global variable instead and avoid mentioning "static" keyword afterward?

▲

unwind 7 hours ago | parent [-]

Oh! Thanks, I was not being as concrete as I imagined. Sorry.

Yes, the `static` can simply be dropped, it does no additional work for a single-file snippet like this.

I tried diving into Compiler Explorer to examine this, and it actually produces slightly different code for the with/without `static` cases, but it was confusing to deeply understand quickly enough to use the output here. Sorry.

▲

mananaysiempre 7 hours ago | parent [-]

I see exactly the same assembly from x86-64 GCC 15.2 with -O2 the first example in the article both as is and without `static`, which makes sense. The two do differ if you add -fPIC, as though you’re compiling a dynamic library, and do not add -fvisibility=hidden at the same time, but that’s because Linux dynamic linking is badly designed.

▲

Chabsff 7 hours ago | parent [-]

TU-level concepts (mostly) dissolve during the linking stage. You need to compile with -c to generate an object file in order to see the distinction.

Also, the difference manifests in the symbols table, not the assembly.

	▲	mananaysiempre 6 hours ago \| parent [-]
		To clarify, I was talking about Compiler Explorer-cleaned disassembly, same as the comment I was replying to.

▲ Rochus 10 hours ago | parent | prev | next [-]

The benchmark demonstrates that the modern C++ "Lambda" approach (creating a unique struct with fields for captured variables) is effectively a compile-time calculated static link. Because the compiler sees the entire definition, it can flatten the "link" into direct member access, which is why it wins. The performance penalty the author sees in GCC is partly due to the OS/CPU overhead of managing executable stacks, not just code inefficiency. The author correctly identifies that C is missing a primitive that low-level languages perfected decades ago: the bound method (wide) pointer.

The most striking surprise is the magnitude of the gap between std::function and std::function_ref. It turns out std::function (the owning container) forces a "copy-by-value" semantics deeply into the recursion. In the "Man-or-Boy" test, this apparently causes an exponential explosion of copying the closure state at every recursive step. std::function_ref (the non-owning view) avoids this entirely.

▲

gpderetta 9 hours ago | parent [-]

Even if you never copy the std::function the overhead is very large. GCC (14 at least) does not seem to be able to elide the allocation, nor inline the function itself, even if used immediately after use and the object never escapes the function. Given the opportunity, GCC seems to be able to completely remove one layer pf function_ref, but fails at two layers.

▲

Rochus 9 hours ago | parent | next [-]

This is exactly right, and the "Man-or-Boy" benchmark hits the worst-case scenario for libstdc++ specifically. The optimization fails here. My "copy-by-value" comment refers to the ownership semantics. Since std::function owns its storage, and the Man-or-Boy recursion passes the closure into the next layer (often by value or by capturing it into a new closure), we trigger the copy constructor. If the SBO limit is exceeded, that copy constructor performs a new heap allocation and a deep copy of the state.

▲

boris 9 hours ago | parent | prev [-]

GCC (libstdc++) as all other major C++ runtimes (libc++, MSVC) implements the small object optimization for std::function where a small enough callable is stored directly in std::function's state instead of on the heap. Across these implementations, you can reply on being able to capture two pointers without a dynamic allocation.

	▲	gpderetta 9 hours ago \| parent [-]
		You would think so, but it actually doesn't. last time I checked, libstdc++ could only optimize std::bind closures. A trivial test with a stateless lambda shows this is still the case in GCC14 and 15. In fact I can't even seem to trigger the library optimization with bind. Differently from GCC14, GCC15 itself does seem to be able to optimize the allocation (and the whole std::function) in trivial cases though (independently of what the library does).

▲ uecker 2 hours ago | parent | prev | next [-]

BTW: I wrote why the lambda design does not fit C well here:

https://www.open-std.org/jtc1/sc22/wg14/www/docs/n3654.pdf

(and I am not impressed by micro benchmarks)

▲

CerryuDu an hour ago | parent [-]

From the introduction, your paper seems like a counterproposal: support closures, just not the way others propose. But the paper seems to accept that closures / nested functions, supported at the language level directly, are a "good thing" for C specifically. I disagree with that. When and how has it become the consensus?

	▲	uecker 8 minutes ago \| parent [-]
		At the moment there is no consensus for anything related to this. But most people agree that closures are a good thing because they make certain reoccurring programming patterns safer and easier to write. Why do you disagree?

▲ RossBencina 12 hours ago | parent | prev | next [-]

Good to see Borland's __closure extension got a mention.

Something I've been thinking about lately is having a "state" keyword for declaring variables in a "stateful" function. This works just like "static" except instead of having a single global instance of each variable the variables are added to an automatically defined struct, whose type is available using "statetype(foo)" or some other mechanism, then you can invoke foo as with an instance of the state (in C this would be an explicit first parameter also marked with the "state" parameter.) Stateful functions are colored in the sense that if you invoke a nested stateful function its state gets added to the caller's state. This probably won't fly with separate compilation though.

▲

vintagedave 10 hours ago | parent | next [-]

Yes, though it was a remarkably brief mention. I believe Borland tried to standardise it back in 2002 or so,* along with properties. (I was the C++Builder PM, but a decade and a half after that attempt.)

C++Builder’s entire UI system is built around __closure and it is remarkably efficient: effectively, a very neat fat pointer of object instance and method.

[*] Edit: two dates on the paper, but “bound pointer to member” and they note the connection to events too: https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2002/n13...

▲

tyushk 4 hours ago | parent | prev | next [-]

Would this be similar to how Rust handles async? The compiler creates a state machine representing every await point and in-scope variables at that point. Resuming the function passes that state machine into another function that matches on the state and continues the async function, returning either another state or a final value.

▲

1f60c 9 hours ago | parent | prev | next [-]

> a "state" keyword for declaring variables in a "stateful" function

Raku (née Perl 6) has this! https://docs.raku.org/language/variables#The_state_declarato...

▲

fuhsnn 10 hours ago | parent | prev | next [-]

I dreamed up a similar idea[1] upon reading the author's closure proposal, it's also really close to async coroutines.

[1] https://github.com/ThePhD/future_cxx/issues/55#issuecomment-...

▲

juvoly 11 hours ago | parent | prev [-]

That sounds cool, but this quickly gets complicated. Some aspects that need to be addressed:

- where does the automatically defined struct live? Data segment might work for static, but doesn't allow dynamic use. Stack will be garbage if closure outlives function context (ie. callback, future). Heap might work, but how do you prevent leaks without C++/Rust RAII?

- while a function pointer may be copied or moved, the state area probably cannot. It may contain pointers to stack object or point into itself (think Rust's pinning)

- you already mention recursion, compilation

- ...

	▲	fuhsnn 10 hours ago \| parent [-]
		IMO the C way is to allow users to explicitly manage context area, along the lines of posix ucontext.h or how the author's closure proposal handle closure allocation[1]. [1] https://thephd.dev/_vendor/future_cxx/papers/C%20-%20Functio...

▲ sirwhinesalot 10 hours ago | parent | prev | next [-]

I think local functions (like the GNU extension) that behave like C++ byref(&) capturing lambdas makes the most sense for C.

You can call the local functions directly and get the benefits of the specialized code.

There's no way to spell out this function's type, and no way to store it anywhere. This is true of regular functions too!

To pass it around you need to use the type-erased "fat pointer" version.

I don't see how anything else makes sense for C.

▲

gpderetta 9 hours ago | parent | next [-]

> There's no way to spell out this function's type, and no way to store it anywhere. This is true of regular functions too!

well regular functions decay to function pointers. You could have the moral equivalent of std::function_ref (or similarly, borland __closure) in C of course and have closures decay to it.

▲

nutjob2 9 hours ago | parent | prev [-]

The price you pay for GCC nested (local) functions is an executable stack with 'trampolines'.

I'm a fan of nested functions but don't think the executable stack hack is worth it, and using a 'display' is a better solution.

See the Dragon Book or Compiler Construction: Principles and Practice (1984) by Louden

▲

sirwhinesalot 9 hours ago | parent [-]

You misunderstood my comment. GNU local function syntax, C++ [&] lambda behavior (i.e., a hidden struct).

▲

nutjob2 8 hours ago | parent [-]

I really did, my comment is specific to C.

	▲	LegionMammal978 5 hours ago \| parent [-]
		The only reason that GCC needs executable trampolines is for the program to be able to create an ordinary function pointer and have all the captured data come along with it. The proposal is to reuse the syntax of nested functions, but change the semantics so that they are no longer callable via ordinary function pointers, but rather "fat pointers" that reference the captured data alongside the raw function address. This is similar to the method used by C++ and does not need trampolines.

▲ Progge 11 hours ago | parent | prev | next [-]

Long time ago I wrote C. Could anyone fill me in why the first code snippet is arg parsing the way it is?

int main(int argc, char* argv[]) {

  if (argc > 1) {

    char\* r_loc = strchr(argv[1], 'r');

    if (r_loc != NULL) {

      ptrdiff_t r_from_start = (r_loc - argv[1]);

      if (r_from_start == 1 && argv[1][0] == '-' && strlen(r_loc) == 1) {
        in_reverse = 1;
      } 

    }

  }

  ...

}

Why not

if (argc > 1 && strcmp(argv[1], "-r") == 0) {

    in_reverse = 1;

}

for example?

▲ tapete2 11 hours ago | parent | next [-]

It doesn't even make sense to use strchr for determining the position of 'r', when the code checks that the position of '-' is at index 0.

Your solution is perfectly fine. Even if you don't have access to strchr for some reason, the original snippet is really convoluted.

You could just write (strlen(argv[1]) > 1 && argv[1][0] == '-' && argv[1][0] == 'r') if you really want to.

	▲	microtherion 10 hours ago \| parent \| next [-]
		It could make some sense to use strchr, because in idiomatic UNIX tools, single character command line options can be clustered. But that also means that subsequent code should not be tested for a specific position. And if you ever find yourself actually doing command line parsing, use getopt(). It handles all the corner cases reliably, and consistent with other tools.
	▲	unwind 8 hours ago \| parent \| prev \| next [-]
		Of course, `&&` in C is short-circuiting so it's safe without the `strlen()` too, as long as the argument is there i.e. not NULL. Also, the use of a convoluted `if` to conditionally assign a literal boolean is a code smell (to me), I would drop the `if` and just use: `in_reverse = argc > 0 && argv[1][0] == '-' && argv[1][1] == 'r';` if a more forward-thinking/strict check is not needed.
	▲	eska 5 hours ago \| parent \| prev [-]
		Your code actually has 2 bugs. The first I assume is just a typo and you meant to use [1][1] == ‘r’. The second one is that you would accept “-rblah” as well.

▲ CerryuDu 2 hours ago | parent | prev | next [-]

Not to mention the potential signed integer overflow in (*right - *left) and (*left - *right), which is undefined behavior. And even if you rely on common two's complement wraparound, the result may be wrong; for example, (INT_MAX-(-1)) should mathematically yield a positive value, but the function will produce INT_MIN, which is negative.

And then we have this "modern" way of spelling pointers, "const int* right" (note the space). In C, declaration syntax mirrors use, so it should be "const int *right", because "*right" is a "const int".

I feel too old for this shit. :(

▲ Joker_vD 8 hours ago | parent | prev [-]

I suspect it was adopted from a bigger snippet that had support for parsing things like "-abc" as "-a -b -c", etc.

▲ hyperbolablabla 5 hours ago | parent | prev | next [-]

Stewart Lynch in his 10x VODs mentions his custom Function abstraction in C++. It's super clean and explicit, avoiding `auto` requirement of C++ lambdas. It's use looks something akin to:

    // imagine my_function takes 3 ints, the first 2 args are captured and curried.
    Function<void(int)> my_closure(&my_function, 1, 2);
    my_closure(3);

I've never implemented it myself, as I don't use C++ features all too much, but as a pet project I'd like to someday. I wonder how something like that compares!

	▲	spacechild1 4 hours ago \| parent [-]
		Isn't this basically the same as passing the function to std::bind_front and storing it in a std::function or std::function_ref?

▲ kazinator 3 hours ago | parent | prev | next [-]

Defininig a callback interface in C without a user context parameter is a capital crime.

	▲	CerryuDu 2 hours ago \| parent [-]
		That's all there is to it. I don't understand the whole obsession with closures. I've used lambdas extensively in modern C++. I hate them with a passion. I've also used OCaml. An awesome language where this stuff is super natural and beautiful. I don't understand why people want to shoehorn functional programming into C. C++ was terrible already, and is now worse for it. > we’re going to be focusing on and looking specifically at Closures in C and C++, since this is going to be about trying to work with and – eventually – standardize something for ISO C that works for everyone. Sigh. My heart sinks.

▲ nesarkvechnep 10 hours ago | parent | prev | next [-]

I'm thinking of using C++ for a personal project specifically for the lambdas and RAII.

I have a case where I need to create a static templated lambda to be passed to C as a pointer. Such thing is impossible in Rust, which I considered at first.

▲ pornel 8 hours ago | parent | next [-]

Yeah, Rust closures that capture data are fat pointers { fn*, data* }, so you need an awkward dance to make them thin pointers for C.

    let mut state = 1;
    let mut fat_closure = || state += 1;
    let (fnptr, userdata) = make_trampoline(&mut &mut fat_closure);

    unsafe {
        fnptr(userdata);
    }

    assert_eq!(state, 2);

    use std::ffi::c_void;
    fn make_trampoline<C: FnMut()>(closure: &mut &mut C) -> (unsafe fn(*mut c_void), *mut c_void) {
        let fnptr = |userdata: *mut c_void| {
            let closure: *mut &mut C = userdata.cast();
            (unsafe { &mut *closure })()
        };
        (fnptr, closure as *mut _ as *mut c_void)
    }

It requires a userdata arg for the C function, since there's no allocation or executable-stack magic to give a unique function pointer to each data instance. OTOH it's zero-cost. The generic make_trampoline inlines code of the closure, so there's no extra indirection.

	▲	skavi 4 hours ago \| parent \| next [-]
		> Rust closures that capture data are fat pointers { fn, data } This isn’t fully accurate. In your example, `&mut C` actually has the same layout as usize. It’s not a fat pointer. `C` is a concrete type and essentially just an anonymous struct with FnMut implemented for it. You’re probably thinking of `&mut dyn FnMut` which is a fat pointer that pairs a pointer to the data with a pointer to a VTable. So in your specific example, the double indirection is unnecessary. The following passes miri: https://play.rust-lang.org/?version=nightly&mode=debug&editi... (did this on mobile, so please excuse any messiness).
	▲	nesarkvechnep 7 hours ago \| parent \| prev [-]
		I know about this technique but it uses too much unsafe for my taste. Not that it's bad or anything, just a personal preference.

▲ queuebert 4 hours ago | parent | prev [-]

In Rust, could you instead use a templated struct wrapping a function pointer along with #[repr(C)]?

▲ groundzeros2015 6 hours ago | parent | prev | next [-]

Thread locals do solve the problem. You create a wrapper around the original function. You set a global thread local user data, you pass in a function which calls the function pointer accepting the user data with the global one.

▲

srcreigh 4 hours ago | parent | next [-]

Yep. Thread locals are probably faster than the other solutions shown too.

It’s confusing to me that thread locals are “not the best idea outside small snippets” meanwhile the top solution is templating on recursion depth with a constexpr limit of 11.

	▲	groundzeros2015 an hour ago \| parent [-]
		The method of having static variables to store state in functions is used heavily in ANSI C book. It’s honestly a beautiful technique when used prudently.

▲

gpderetta 3 hours ago | parent | prev [-]

reentrancy.

	▲	groundzeros2015 an hour ago \| parent [-]
		It doesn’t store state for later. It’s literally impossible to tell it’s happening.

▲ mgaunard 11 hours ago | parent | prev | next [-]

I feel the results say more about the testing methodology and inlining settings than anything else.

Practically speaking all lambda options except for the one involving allocation (why would you even do that) are equivalent modulo inlining.

In particular, the caveat with the type erasure/helper variants is precisely that it prevents inlining, but given everything is in the same translation unit and isn't runtime-driven, it's still possible for the compiler to devirtualize.

I think it would be more interesting to make measurements when controlling explicitly whether inlining happens or the function type can be deduced statically.

▲

gpderetta 10 hours ago | parent [-]

Given a Sufficiently Good™ compiler, yes, after devirtualization and heap elision all variants should generate exactly the same code. In practice is more complicated. Devirtualization needs to runs after (potentially interprocedural) constant propagation, which might be too late to take advantage of other optimization opportunities, unless the compiler keeps rerunning the optimization pipeline.

In a simple test I see that GCC14 has no problems completely removing the overhead of std::function_ref, but plain std::function is a huge mess.

Eventually we will get there [1], but in the meantime I prefer not to rely on devirtualization, and heap elision is more of a party trick.

edit: to compare early vs late inlining: while gcc 14 can remove one layer of function_ref, it seems that it cannot remove two layers, as apparently doesn't rerun the required passes to take advantage of the new opportunity. It has no problem of course removing an arbitrary large (but finite) layers of plain lambdas.

edit2: GCC15 can remove trivial uses of std::function, but this is very fragile. It still can't remove two function_ref.

[1] for example 25 years ago compilers were terrible at removing abstraction overhead of the STL, today there is very little cost.

	▲	mgaunard 3 hours ago \| parent [-]
		You can just write the benchmark in such a way that the optimizations are not possible.

▲ ddtaylor 12 hours ago | parent | prev | next [-]

I actually enjoy trampoline functions in C a bit and it's one of the GNU extensions I use sometimes.

▲ keymasta 8 hours ago | parent | prev | next [-]

It's a post about Man or Boy... and the only typo is... the word _son_. Pretty sure it's supposed to be "on"

▲ psyclobe 10 hours ago | parent | prev | next [-]

c++ for the win!! finally!!

▲ capestart 11 hours ago | parent | prev | next [-]

The breakdown of lambda, blocks, and nested functions demonstrates how important implementation and ABI details are in addition to syntax. I think the standard for C should include a straightforward, first class wide function pointer along with a closure story to stop people from adding these half portable, half spooky extensions.

	▲	uecker 3 hours ago \| parent [-]
		This.

▲ trgn 6 hours ago | parent | prev [-]

i wish JS gurus understood this before jumping all in on hooks and bloating the runtime footprint of every web app out there