Remix.run Logo
jimmaswell 4 days ago

I wonder if we could make an LLM or other modern machine learning framework finally figure out how to compile to Itanic in an optimized fashion.

duskwuff 4 days ago | parent | next [-]

No. The problems involved are fundamental:

1) Load/store latency is unpredictable - whenever you get a cache miss (which is unpredictable*), you have to wait for the value to come back from main memory (which is getting longer and longer as CPUs get faster and memory latency roughly stays the same). Statically scheduling around this sort of unpredictable latency is extremely difficult; you're better off doing it on the fly.

2) Modern algorithms for branch prediction and speculative execution are dynamic. They can make observations like "this branch has been taken 15/16 of the last times we've hit it, we'll predict it's taken the next time" which are potentially workload-dependent. Compile-time optimization can't do that.

*: if you could reliably predict when the cache would miss, you'd use that to make a better cache replacement algorithm

brucehoult 4 days ago | parent [-]

> this branch has been taken 15/16 of the last times we've hit it

That is kind of how it worked more than 30 years ago (pre 1995), but not since, at least in OoO CPUs.

In fact it was found that having more than a 2-bit saturating counter doesn't help, because when the situation changes it takes too many bad predictions in a row to get to predictions that, actually, this branch is not being taken any more.

What both the Pentium Pro and PowerPC 604 (the first OoO designs in each family) had was a global history of how you GOT TO the current branch. The Pentium Pro had 4 bits of taken/not taken history for the last four conditional branches and this was used to decide which 2-bit counter to use for a given branch instruction. The PowerPC 604 used 6 bits of history. The Pentium Pro algorithm for combing the branch address with the history (XOR them!) is called "gshare". The PPC604 did something a little bit different but I'm not sure what. By the PPC750 Motorola was using basically the same gshare algorithm as Intel.

There are newer and better algorithms today -- exactly what is somewhat secret in leading edge CPUs -- but gshare is simple and is common in low end in-order and small OoO CPUs to this day. The Berkeley BOOM core uses a 13 bit branch history. I think early SiFive in-order cores such as the E31 and U54 used 10 bits.

duskwuff 4 days ago | parent [-]

Fair point, I oversimplified a bit. Either way, what matter is that it's dynamic.

o11c 4 days ago | parent | prev [-]

No, VLIW is fundamentally a flawed idea; OoO is mandatory. "We need better compilers" is purely Intel marketing apologia.

whaleofatw2022 4 days ago | parent [-]

Isn't VILW how a number of GPUs worked internally? That said GPU isn't the same as GPC

buildbot 4 days ago | parent | next [-]

Yes, as other noted AMD used VLIW for terscale in the 2000-6000 series. https://en.wikipedia.org/wiki/TeraScale_(microarchitecture)

They are used in a lot of DSP chips too, where you (hopefully) have very simple branching if any and nice data access patterns.

Sesse__ 4 days ago | parent [-]

And typically just fast RAM everywhere instead of caches, so you don't have cache misses. (The flip side is that you typically have very _little_ RAM.)

classichasclass 4 days ago | parent | prev | next [-]

Some older ones, yeah (TeraScale comes to mind) but modern ones are more like RISC with whopping levels of SIMD. It turns out that VLIW was hard for them too.

4 days ago | parent | prev [-]
[deleted]