> My mental model of the branch predictor was that for each branch, the CPU maintained some internal state like "probably taken/not taken" or "indeterminate", and it "learned" by executing the branch many times.

I always figured the algorithm was much simpler, it would just use the same branch as last execution — should work fairly well.

Didn’t realize it used the input value as well, which to me makes no sense — the whole point is to avoid having to inspect the value. This article raises more questions than answers, I’m intrigued now.

▲

Tuna-Fish 2 days ago | parent | next [-]

It does not and cannot use the input value. The branch predictor runs like a dozen cycles before execution, it generally cannot possibly see values in registers. It also runs like 3-4 cycles before decode, so it cannot even use any bits from the instruction that is being executed.⁰ Branch predictors strictly use branch history, but this is more complex than just looking at the probability of branching on a single branch, there are things like tables that maintain all branches over the past tens of thousands of cycles, and try to cover common patterns.

0: This is why the first prediction is always "don't branch", because the first time executing code the predictor has literally no information at all. Every now and then people ask for hint bits on branches, but, er, how are you planning to do that when the instruction with the branch hasn't arrived from L1 when the prediction is due?

▲

nkurz 2 days ago | parent | prev | next [-]

> I always figured the algorithm was much simpler, it would just use the same branch as last execution — should work fairly well.

Sure, that would work significantly better than no predictor at all. But you'd agree that a better predictor would work better, right? The missing detail might be how expensive mispredicted branches are compared to other costs. If you can go from 50% accuracy to 90% accuracy, it wouldn't be surprising to more than double your performance.

> Didn’t realize it used the input value as well, which to me makes no sense — the whole point is to avoid having to inspect the value.

It doesn't, and can't for the reasons you hint at. The reason branch prediction is necessary is that the value often isn't available yet when the branch is taken. Was there something in the article that implied the opposite?

I wonder if Daniel's tricksy approach using a random number generator to simulate a complex pattern is misleading people here.

One of the main benefits of branch prediction is predicting the end of a loop, particularly, a loop within a loop. In assembly, a loop is just a comparison at the end and a branch back to the beginning. Assume you had a loop that always executes 8 times, or some other small fixed value. Also assume there is some reason you can't unroll that loop, and that loop is inside another loop that executes millions of times. It's a real boost to performance if you can consistently predict the end of the inner loop.

If you predicted just on the last time the loop closing branch was taken, you'd always miss the ending. But if you can remember a pattern that is longer than 8, you can always get it right. This is obviously valuable. The bigger question is how much more valuable it is to predict a loop (where "loop" might actually be a complex execution pattern across multiple branches) that is thousands long rather than just 8. But quantifying how long this pattern can be on different processors is part of the groundwork for analyzing this.

▲

pyrolistical 2 days ago | parent | prev [-]

Agreed. I wonder if this silicon is designed for this benchmark and if not how useful is it with real code.

I would be surprised if this silicon area could not be better utilized for something else

	▲	Tuna-Fish 2 days ago \| parent [-]
		No, branch predictors are really important and even small improvements in them are extremely valuable on real loads. Improving branch prediction is both a power and a performance optimization.