Remix.run Logo
whitten 2 days ago

I know branch prediction is essential if you have instruction pipelining in actual CPU hardware.

It is an interesting thought experiment re instruction pipelining in a virtual machine or interpreter design. What would you change in a design to allow it ? Would an asynchronous architecture be necessary ? How would you merge control flow together efficiently to take advantage of it ?

addaon 2 days ago | parent | next [-]

> I know branch prediction is essential if you have instruction pipelining in actual CPU hardware.

With sufficiently slow memory, relative to the pipeline speed. A microcontroller executing out of TCM doesn’t gain anything from prediction, since instruction fetches can keep up with the pipeline.

immibis 2 days ago | parent [-]

The head of the pipeline is at least several clock cycles ahead of the tail, by definition. At the time the branch instruction reaches the part of the CPU where it decides whether to branch or not, the next several instructions have already been fetched, decoded and partially executed, and that's thrown away on a mispredicted branch.

There may not be a large delay when executing from TCM with a short pipeline, but it's still there. It can be so small that it doesn't justify the expense of a branch predictor. Many microcontrollers are optimized for power consumption, which means simplicity. I expect microcontroller-class chips to largely run in-order with short pipelines and low-ish clock speeds, although there are exceptions. Older generations of microcontrollers (PIC/AVR) weren't even pipelined at all.

addaon 2 days ago | parent [-]

> but it's still there

Unless you evaluate branches in the second stage of the pipeline and forward them. Or add a delay slot and forward them from the third stage. In the typical case you’re of course correct, but there are many approaches out there.

o11c 2 days ago | parent | prev | next [-]

You have to ensure that virtual instructions map to distinct hardware instructions.

Computed-goto-after-each-instruction is well known, and copying fragments of machine code is obvious.

Less known is "make an entire copy of your interpreter for each state" - though I'm only aware of this as a depessimization for stack machines.

https://dl.acm.org/doi/pdf/10.1145/223428.207165

But the main problem, which none of these solve, is that most VM languages are designed to be impossible (or very difficult) to optimize, due to aggressive use of dynamic typing. Nothing will save you from dynamic types.

cogman10 2 days ago | parent | prev [-]

With the way architectures have gone, I think you'd end up recreating VLIW. The thing holding back VLIW was compilers were too dumb and computers too slow to really take advantage of it. You ended up with a lot of "NOP"s as a result in the output. VLIW is essentially how modern GPUs operate.

The main benefit of VLIW is that it simplifies the processor design by moving the complicated tasks/circuitry into the compiler. Theoretically, the compiler has more information about the intent of the program which allows it to better optimize things.

It would also be somewhat of a security boon. VLIW moves the branch prediction (and rewinding) into the processor. With exploits like spectre, pulling that out would make it easier to integrate compiler hints on security sensitive code "hey, don't spec ex here".

_chris_ 2 days ago | parent [-]

> The thing holding back VLIW was compilers were too dumb

That’s not really the problem.

The real issue is that VLIW requires branches to be strongly biased, statically, so a compiler can exploit them.

But in fact branches are very dynamic but trivially predicted by branch predictors, so branch predictors win.

Not to mention that even vliw cores use branch predictors, because the branch resolution latency is too long to wait for the branch outcome to be known.