Computed goto for efficient dispatch tables (2012)

variadix 2 hours ago | parent | next [-]

You can probably eliminate the switch bounds check by making the default case execute __builtin_unreachable(). The switch version is safe for non-conforming opcodes where the computed goto version invokes UB.

	▲	fweimer an hour ago \| parent [-]
		Using __builtin_unreachable in this way makes the switch version unsafe, too.

▲

jdw64 3 hours ago | parent | prev | next [-]

I understood up to the comparison operations in this article, but I'm having trouble understanding branch prediction even after reading it. It seems like they used branch prediction optimization... Sometimes when I read articles like this, I start to question whether I can really call myself a programmer

▲

MobiusHorizons an hour ago | parent | next [-]

In short modern CPUs can’t actually execute control flow very fast, so they usually cheat and memorize where the program is likely to go next. This works great with things like the loop branch, since most loops are pretty long the cpu can start on the next iteration of the loop while it waits for the condition to finish being calculated. You incur one branch misprediction at the end, but the rest of the loop is fast. But with a bytecode vm, the action of the code is entirely dependent on the bytecode instructions, so it’s not predictable unless those instructions themselves are predictable. That’s why the switch statement is slow. Explaining the optimization from computed gotos is more complicated.

As I understand it, the branch predictor stores its predictions using the address of the branch instruction as a key. In the switch implementation the unpredictable branch is the one that jumps to the specific case. The compiler emits an indirect branch for this just like the computed goto instruction. But it only emits one indirect branch instruction, so that branch is always unpredictable. In the computed goto version, there is no loop, each case statement ends in its own indirect branch, so each bytecode implementation is followed by a branch with a unique address. In practice this makes each of them slightly more predictable, because you only have to predict what comes after an inc instruction instead of what comes next after any instruction. Basically there are more branches to hang predictions on.

▲

adrian_b 3 hours ago | parent | prev [-]

In a CPU, there are 2 kinds of branch predictors, predictors for conditional branches, which must predict only whether the branch is taken or not taken, and predictors for indirect branch instructions, which must predict the address of the next instruction that will be executed after the indirect branch instruction.

The indirect branch predictor stores information about indirect branches in a structure that is similar to a cache memory.

When the CPU loads some data from the main memory, the loaded data together with the address from where it was loaded is stored in the cache memory.

When another load instruction is executed later, the load address is used to query the cache memory and if the query is successful the data value is returned by the query and it is used instead of accessing again the main memory.

Similarly, (simplifying a lot) when an indirect branch instruction is executed the address where the execution continues after the indirect branch is stored together with the address of the branch instruction in an associative memory similar to a cache memory.

When an indirect branch instruction is executed later, the address of the branch instruction is used to query the associative memory and if the query is successful it will return the address of the next instruction to be executed and execution will continue there speculatively, until it is confirmed by reading from the main memory that this is the correct jump address. If the jump address is wrong then all the work done is discarded and execution continues at the right address.

This description is simplified, because modern branch descriptors may store for a given branch instruction address not only the last jump address, but a sequence of past jump addresses, together with a pattern about how they have been used (e.g. alternating between jumping twice to the first address and jumping once to the second address). Thus successive queries for the branch instruction address will retrieve different jump addresses, based on their activation pattern.

The point of TFA is that if the dispatch loop contains a single indexed branch instruction, then the associative memory of the indirect branch predictor contains a single record, keyed by the address of that instruction. Depending on the CPU, the record will contain one or more past addresses, but it can predict correctly only a short periodic pattern of jump addresses, i.e. of opcodes used in dispatching.

This could work to predict correctly a short loop in the interpreted program, but typically most of the predictions will be wrong and each branch misprediction takes much more time than the execution of any instruction. In smaller and cheaper CPUs, which might store a single jump address per record, the correct predictions would be extremely rare, as they would happen only if the interpreted program contains repeated instructions.

In the optimized program, the indexed branch instruction is replicated into each "case" so now the associative memory will store separate records, one for each "case", containing the address or addresses of the next cases where execution has continued in the past.

Because after each interpreted instruction the probabilities of the next instructions are not equal, but some instructions are more likely to follow, that greatly increases the chances that the indirect branch predictor will make correct predictions from time to time.

	▲	monocasa an hour ago \| parent \| next [-]
		In practice, branch target predictors are also used for conditional branches too, and even unconditional, direct jumps. Yes, the target can (sort of by definition) be computed from the instruction and the pc, but the by using the branch target buffer, you avoid pipeline bubbles in the correctly predicted case by only having a data dependency on the pc, and predicting the target while the instruction is still being fetched. That way on the next cycle you're already fetching from the predicted target while still decoding the branch instruction itself in the next stage(s).
	▲	fpoling 2 hours ago \| parent \| prev \| next [-]
		I wonder are there any compilers that recognizes switch-in-loop pattern of interprets and optimizes it to be branch-predictor friendly using multiple indirect jumps?
	▲	jdw64 2 hours ago \| parent \| prev [-]
		You really explain things in a way that's very easy to understand. I think I finally get it now. I understood the conditional branch part, but I think I was having trouble with the indirect branch predictor. So basically, when a branch instruction is executed, the CPU learns the actual target address. In the case of a switch statement, because the key is a single value like s, the history records get mixed up, whereas with a computed goto, since there are multiple keys, they don't get mixed up. Thank you for the clear explanation. It's a bit embarrassing that I didn't know this, but thanks to people like you, I feel like I'm learning more.

▲

nly 4 hours ago | parent | prev | next [-]

All well and good provided your opcodes are sequential/dense

▲

adrian_b 4 hours ago | parent [-]

The same is true for a "switch" statement.

If the "case" values used in a "switch" are sparse, the "switch" will not be compiled into an indexed jump instruction, but into a sequence of conditional jump instructions, which test all the possible cases.

In this situation, the alternative to "switch" for implementing dispatching is no longer a computed "goto", but a multiple "if"/"else" sequence.

A smarter compiler could detect when a "switch" forms the body of a loop and it would replicate the indexed jump instruction at the end of each case, instead of jumping to the beginning of the loop to execute there an indexed jump, or even worse, first jumping to the end of the loop to terminate the "switch", then jumping to the beginning of the loop to repeat the loop body.

With such a compiler, computed "goto" would not be necessary as an alternative to "switch".

The range check inside the dispatch loop would not be necessary if the opcodes had an enumeration type (in a programming language where enumeration types are clearly distinct from integers) and the "switch" handled all the possible cases. In that situation, the range checks would be moved elsewhere in the program, wherever opcodes are generated.

	▲	wahern an hour ago \| parent [-]
		The GCC documentation example for computed goto uses an indexing table, but you can of course use computed goto addresses directly. The problem is smuggling the label addresses outside the function so you don't need the enum/index indirection during dispatch. I once experimented with some techniques and found you can use constructors, either on function-scoped static variables in C++ or using the constructor function attribute on a nested function in GCC C. The addresses are visible to the constructor function and you can smuggle them by copying through a file- or global- scoped array, then at runtime initialize your data structures and bytecode arrays with the computed goto addresses. That preserves the CPUs branch prediction and prefetching resources for other work. You can also lazily initialize things, which works well if you're not precompiling bytecode but implementing something like a coroutine or state machine where you just need to initialize the first dispatch address. Even on modern chips avoiding computed goto indexing can be meaningful (5+%) depending on use case, but the marginal gains from the smuggling shenanigans specifically probably aren't worth the code complexity and portability headaches (clang doesn't support nested functions in C), though I never tried that specific trick with a full-blown bytecode VM. These days with guaranteed tailcall extensions you can just dispatch on function addresses (knowable before dispatch), though you can't directly control inlining optimizations so if you're mostly doing simple work per operation computed gotos might still have better performance. And for small or straight-forward state machines code can be easier to grok when not split across many functions, especially if they're tailcalling each other through pointers.

▲

kibwen 3 hours ago | parent | prev | next [-]

> Therefore, the standard forces the compiler to generate "safe" code for the switch. Safety, as usual, has cost, so the switch version ends up doing a bit more per loop iteration.

Safety only has a cost in this case because the switch is fundamentally just operating on an integer. With an actual enumerated type (rather than C's primitive "enums as numeric aliases"), which even a basic type system could trivially enforce, there would be no need for this check, because the domain of the value would be guaranteed at compile-time.

	▲	flohofwoe 2 hours ago \| parent [-]
		FWIW it's possible to get rid of the range check via a __builtin_unreachable in the default case (which of course is essentially injected UB): https://www.godbolt.org/z/YTcY3E8hx ...it's actually interesting though that the compiler adds the range check on a 'complete' switch over an enum, because the compiler would complain when a case branch is missing (via a warning). E.g. the frontend is smarter than the backend.

▲

froh 4 hours ago | parent | prev [-]

(2012)