VLIW came with an implication that static scheduling would win out. The deeply OoO chips you see now have a very different architecture to support that: Itanium was much more a DSP like thing.

▲ ajross 5 days ago | parent | next [-]

Even in VLIW, DRAM fetches are slow, instructions have variable latency and write-before-retire register collisions require renaming. Itanium would have gotten there at some point. OO isn't an optional feature for high performance systems and that was clear even in the 90's.

▲ wbl 5 days ago | parent [-]

If you have that what's the VLIW getting you?

▲ ajross 5 days ago | parent | next [-]

Fewer transistors and pipeline stages required for the decode unit, which is a real but moderate advantage. And it turned out the window was very narrow and the relative win got smaller and smaller over time. And other externalities where VLIW loses moderately, like total instruction size (i.e. icache footprint) turned out to be more important.

▲

cesarb 5 days ago | parent [-]

> Fewer transistors and pipeline stages required for the decode unit, which is a real but moderate advantage.

Isn't having fixed-size naturally-aligned instructions (like on 64-bit ARM) enough to get that advantage?

▲

ajross 5 days ago | parent [-]

ARM is easier than x86, but not really. VLIW instructions also encode the superscalar pipeline assignments (or a reasonable proxy for them) and are required to be constructed without instruction interdependencies (within the single bundle, anyway), which traditional ISAs need to spend hardware to figure out.

Really VLIW is a fine idea. It's just not that great an idea, and in practice it wasn't enough to save ia64. But it's not what killed it, either.

	▲	codedokode 5 days ago \| parent [-]
		The problem with ia64 was that if you had 1000 legacy applications for x86, written by third-party contractors, for many of which you don't even have the source, then ia64 must be 100x better than standard CPUs to justify rewriting the apps. And by the way that's why open source makes such migrations much cheaper.

▲ codedokode 5 days ago | parent | prev [-]

Out-of-order architectures are inhumanly complex, especially figuring out the dependencies. For example, can we reorder these two instructions or must execute them sequentially?

    ld r1, [r2 + 10]
    st [r3 + 4], r4

And then consider things like speculative execution.

▲

1718627440 3 days ago | parent | next [-]

Honestly to me it seams like optimizing compilers and out-of order CPUs are actually doing the same thing. Can't we get rid of one or the other?

Either have a stupid ISA and do all the work ahead-of-time with way more compute time to optimize or don't optimize and have a higher level ISA, that also hs concepts like pointer provenance.

The current state seams like a local minima with both having ahead-of-time optimization, but the ISA does it's thing anyways and also the compiler throwing much of the information away with OoO analysis being time-critical.

	▲	wbl 3 days ago \| parent [-]
		The compiler doesn't know the dynamic state of the CPU memory hierarchy and you don't want it to. Even the CPU doesn't know until it finds out how long a load will take. Meanwhile the CPU probably can't do a loop invariant hoist in a reasonable way or understand high level semantics.

▲

wbl 4 days ago | parent | prev [-]

But you already pay that price anyway.

▲ tadfisher 5 days ago | parent | prev [-]

If only that could have worked, then we could have avoided the whole Spectre/Meltdown mess and resulting mitigations.