> I know branch prediction is essential if you have instruction pipelining in actual CPU hardware.

With sufficiently slow memory, relative to the pipeline speed. A microcontroller executing out of TCM doesn’t gain anything from prediction, since instruction fetches can keep up with the pipeline.

▲

immibis 2 days ago | parent [-]

The head of the pipeline is at least several clock cycles ahead of the tail, by definition. At the time the branch instruction reaches the part of the CPU where it decides whether to branch or not, the next several instructions have already been fetched, decoded and partially executed, and that's thrown away on a mispredicted branch.

There may not be a large delay when executing from TCM with a short pipeline, but it's still there. It can be so small that it doesn't justify the expense of a branch predictor. Many microcontrollers are optimized for power consumption, which means simplicity. I expect microcontroller-class chips to largely run in-order with short pipelines and low-ish clock speeds, although there are exceptions. Older generations of microcontrollers (PIC/AVR) weren't even pipelined at all.

	▲	addaon 2 days ago \| parent [-]
		> but it's still there Unless you evaluate branches in the second stage of the pipeline and forward them. Or add a delay slot and forward them from the third stage. In the typical case you’re of course correct, but there are many approaches out there.