Remix.run Logo
duskwuff 4 days ago

No. The problems involved are fundamental:

1) Load/store latency is unpredictable - whenever you get a cache miss (which is unpredictable*), you have to wait for the value to come back from main memory (which is getting longer and longer as CPUs get faster and memory latency roughly stays the same). Statically scheduling around this sort of unpredictable latency is extremely difficult; you're better off doing it on the fly.

2) Modern algorithms for branch prediction and speculative execution are dynamic. They can make observations like "this branch has been taken 15/16 of the last times we've hit it, we'll predict it's taken the next time" which are potentially workload-dependent. Compile-time optimization can't do that.

*: if you could reliably predict when the cache would miss, you'd use that to make a better cache replacement algorithm

brucehoult 4 days ago | parent [-]

> this branch has been taken 15/16 of the last times we've hit it

That is kind of how it worked more than 30 years ago (pre 1995), but not since, at least in OoO CPUs.

In fact it was found that having more than a 2-bit saturating counter doesn't help, because when the situation changes it takes too many bad predictions in a row to get to predictions that, actually, this branch is not being taken any more.

What both the Pentium Pro and PowerPC 604 (the first OoO designs in each family) had was a global history of how you GOT TO the current branch. The Pentium Pro had 4 bits of taken/not taken history for the last four conditional branches and this was used to decide which 2-bit counter to use for a given branch instruction. The PowerPC 604 used 6 bits of history. The Pentium Pro algorithm for combing the branch address with the history (XOR them!) is called "gshare". The PPC604 did something a little bit different but I'm not sure what. By the PPC750 Motorola was using basically the same gshare algorithm as Intel.

There are newer and better algorithms today -- exactly what is somewhat secret in leading edge CPUs -- but gshare is simple and is common in low end in-order and small OoO CPUs to this day. The Berkeley BOOM core uses a 13 bit branch history. I think early SiFive in-order cores such as the E31 and U54 used 10 bits.

duskwuff 4 days ago | parent [-]

Fair point, I oversimplified a bit. Either way, what matter is that it's dynamic.