| ▲ | classichasclass 4 days ago |
| (author) I thought PA-RISC could go a lot further. It had good performance and a fairly sane ISA, and HP put a lot of work into it. Dropping it for Itanic was one of HP's poorest decisions, because compilers just weren't sophisticated enough to make EPIC VLIW efficient, and arguably still aren't. The issue about the "one instruction per cycle" limit got addressed in a different way. |
|
| ▲ | ndiddy 4 days ago | parent | next [-] |
| HP, DEC/Compaq, and SGI all made the decision to drop their various bespoke architectures for Itanium years before prototypes were available based solely on what Intel claimed performance would be on paper. Even Sun and IBM made noise about doing the same thing. Honestly, I think it was inevitable that something like this would happen. By the late 90s, it was starting to get too expensive for each individual high-end server/workstation company to continue investing in high-performance chip design and leading edge fabs to make low-volume parts, so it made sense for all of them to standardize on a common architecture. The mistake everyone made was choosing Itanium to be the industry standard. |
| |
| ▲ | classichasclass 4 days ago | parent | next [-] | | Yes, that's all true, but I blame HP more than the others because a large part of what went into Itanium came from HP. They thought that with their simulation results they would eclipse all other architectures with Itanic, and they were way off base, junking an architecture that had room to grow in the process. Even so, PA-RISC was still competitive at least into the days of Mako, though they kind of phoned it in with Shortfin (the last PA-8900). | |
| ▲ | kev009 4 days ago | parent | prev [-] | | IBM did ship a few generations of Itanium hardware, they just smartly never bet the farm on it. MIPS and SPARC were always a little weak versus contemporaries, if SGI had forestalled a bit with the R18k that would have been enough time to read the tea leaves and jump to Opteron instead. PA-RISC and Alpha had big enough captive markets and some legs left that got pulled out too soon. That paradoxically might had led to a healthier Sun that went all in on Opteron. |
|
|
| ▲ | jeffbee 4 days ago | parent | prev | next [-] |
| I mean, the performance was better than "fair". Nothing could touch it including Alpha. They abandoned it right at the top of their game. |
| |
| ▲ | kev009 4 days ago | parent [-] | | The most astonishing thing about this is it was done under forbearance of the ISA.. PA-RISC ISA was basically frozen in 1996 and they were able to ride that at the top for years. For instance PA-RISC doesn't really have appropriate instructions for desirable atomic operations. But it led to working on the right problems, a hardwired control RISCy chip that happened to be philosophically similar to the survivor POWER. |
|
|
| ▲ | jimmaswell 4 days ago | parent | prev [-] |
| I wonder if we could make an LLM or other modern machine learning framework finally figure out how to compile to Itanic in an optimized fashion. |
| |
| ▲ | duskwuff 4 days ago | parent | next [-] | | No. The problems involved are fundamental: 1) Load/store latency is unpredictable - whenever you get a cache miss (which is unpredictable*), you have to wait for the value to come back from main memory (which is getting longer and longer as CPUs get faster and memory latency roughly stays the same). Statically scheduling around this sort of unpredictable latency is extremely difficult; you're better off doing it on the fly. 2) Modern algorithms for branch prediction and speculative execution are dynamic. They can make observations like "this branch has been taken 15/16 of the last times we've hit it, we'll predict it's taken the next time" which are potentially workload-dependent. Compile-time optimization can't do that. *: if you could reliably predict when the cache would miss, you'd use that to make a better cache replacement algorithm | | |
| ▲ | brucehoult 4 days ago | parent [-] | | > this branch has been taken 15/16 of the last times we've hit it That is kind of how it worked more than 30 years ago (pre 1995), but not since, at least in OoO CPUs. In fact it was found that having more than a 2-bit saturating counter doesn't help, because when the situation changes it takes too many bad predictions in a row to get to predictions that, actually, this branch is not being taken any more. What both the Pentium Pro and PowerPC 604 (the first OoO designs in each family) had was a global history of how you GOT TO the current branch. The Pentium Pro had 4 bits of taken/not taken history for the last four conditional branches and this was used to decide which 2-bit counter to use for a given branch instruction. The PowerPC 604 used 6 bits of history. The Pentium Pro algorithm for combing the branch address with the history (XOR them!) is called "gshare". The PPC604 did something a little bit different but I'm not sure what. By the PPC750 Motorola was using basically the same gshare algorithm as Intel. There are newer and better algorithms today -- exactly what is somewhat secret in leading edge CPUs -- but gshare is simple and is common in low end in-order and small OoO CPUs to this day. The Berkeley BOOM core uses a 13 bit branch history. I think early SiFive in-order cores such as the E31 and U54 used 10 bits. | | |
| ▲ | duskwuff 4 days ago | parent [-] | | Fair point, I oversimplified a bit. Either way, what matter is that it's dynamic. |
|
| |
| ▲ | o11c 4 days ago | parent | prev [-] | | No, VLIW is fundamentally a flawed idea; OoO is mandatory. "We need better compilers" is purely Intel marketing apologia. | | |
| ▲ | whaleofatw2022 4 days ago | parent [-] | | Isn't VILW how a number of GPUs worked internally? That said GPU isn't the same as GPC | | |
| ▲ | buildbot 4 days ago | parent | next [-] | | Yes, as other noted AMD used VLIW for terscale in the 2000-6000 series. https://en.wikipedia.org/wiki/TeraScale_(microarchitecture) They are used in a lot of DSP chips too, where you (hopefully) have very simple branching if any and nice data access patterns. | | |
| ▲ | Sesse__ 4 days ago | parent [-] | | And typically just fast RAM everywhere instead of caches, so you don't have cache misses. (The flip side is that you typically have very _little_ RAM.) |
| |
| ▲ | classichasclass 4 days ago | parent | prev | next [-] | | Some older ones, yeah (TeraScale comes to mind) but modern ones are more like RISC with whopping levels of SIMD. It turns out that VLIW was hard for them too. | |
| ▲ | 4 days ago | parent | prev [-] | | [deleted] |
|
|
|