But not because of its ISA. I mean, to first approximation everything is a "flop" in semiconductor architectures (or really in tech in general). The population of genuinely successful products is a tiny fraction of the stuff people tried to sell.

In this particular case: ia64 leaned hard into wide VLIW in an era where growing transistor budgets made it possible to decode and issue traditional instructions in parallel[1]. The Itaniums really were fine CPUs, they just weren't particularly advantageous relative to the P6 cores against which they were competing, so no one bought them.

[1] In some sense, VLIW won as a matter of pipeline architecture, it only lost as a design point in ISA specs. Your Macbook is issuing 10 arm64 instructions every cycle, and it doesn't need to futz with the instruction format to do it.

▲ wbl 5 days ago | parent [-]

VLIW came with an implication that static scheduling would win out. The deeply OoO chips you see now have a very different architecture to support that: Itanium was much more a DSP like thing.

▲ ajross 5 days ago | parent | next [-]

Even in VLIW, DRAM fetches are slow, instructions have variable latency and write-before-retire register collisions require renaming. Itanium would have gotten there at some point. OO isn't an optional feature for high performance systems and that was clear even in the 90's.

▲ wbl 5 days ago | parent [-]

If you have that what's the VLIW getting you?

▲ ajross 5 days ago | parent | next [-]

Fewer transistors and pipeline stages required for the decode unit, which is a real but moderate advantage. And it turned out the window was very narrow and the relative win got smaller and smaller over time. And other externalities where VLIW loses moderately, like total instruction size (i.e. icache footprint) turned out to be more important.

▲

cesarb 5 days ago | parent [-]

> Fewer transistors and pipeline stages required for the decode unit, which is a real but moderate advantage.

Isn't having fixed-size naturally-aligned instructions (like on 64-bit ARM) enough to get that advantage?

▲

ajross 5 days ago | parent [-]

ARM is easier than x86, but not really. VLIW instructions also encode the superscalar pipeline assignments (or a reasonable proxy for them) and are required to be constructed without instruction interdependencies (within the single bundle, anyway), which traditional ISAs need to spend hardware to figure out.

Really VLIW is a fine idea. It's just not that great an idea, and in practice it wasn't enough to save ia64. But it's not what killed it, either.

	▲	codedokode 5 days ago \| parent [-]
		The problem with ia64 was that if you had 1000 legacy applications for x86, written by third-party contractors, for many of which you don't even have the source, then ia64 must be 100x better than standard CPUs to justify rewriting the apps. And by the way that's why open source makes such migrations much cheaper.

▲ codedokode 5 days ago | parent | prev [-]

Out-of-order architectures are inhumanly complex, especially figuring out the dependencies. For example, can we reorder these two instructions or must execute them sequentially?

    ld r1, [r2 + 10]
    st [r3 + 4], r4

And then consider things like speculative execution.

▲

1718627440 3 days ago | parent | next [-]

Honestly to me it seams like optimizing compilers and out-of order CPUs are actually doing the same thing. Can't we get rid of one or the other?

Either have a stupid ISA and do all the work ahead-of-time with way more compute time to optimize or don't optimize and have a higher level ISA, that also hs concepts like pointer provenance.

The current state seams like a local minima with both having ahead-of-time optimization, but the ISA does it's thing anyways and also the compiler throwing much of the information away with OoO analysis being time-critical.

	▲	wbl 3 days ago \| parent [-]
		The compiler doesn't know the dynamic state of the CPU memory hierarchy and you don't want it to. Even the CPU doesn't know until it finds out how long a load will take. Meanwhile the CPU probably can't do a loop invariant hoist in a reasonable way or understand high level semantics.

▲

wbl 4 days ago | parent | prev [-]

But you already pay that price anyway.

▲ tadfisher 5 days ago | parent | prev [-]

If only that could have worked, then we could have avoided the whole Spectre/Meltdown mess and resulting mitigations.