Variable length decoding is more or less figured out, but it takes more design effort, transistors and energy. They cost, but not a lot, relatively, in a current state of the art super wide out-of-order CPU.

▲

wallopinski 3 days ago | parent | next [-]

"Transistors are free."

That was pretty much the uArch/design mantra at intel.

▲

nerpderp82 3 days ago | parent | next [-]

Isn't that still true for high perf chips? We don't have ways to use all those transistors so we make larger and larger caches.

▲

exmadscientist 3 days ago | parent [-]

Max-performance chips even introduce dead dummy transistors ("dark silicon") to provide a bit of heat sinking capability. Having transistors that are sometimes-but-rarely useful is no problem whatsoever for modern processes.

▲

yvdriess 3 days ago | parent [-]

AFAIK the dark silicon term is specifically those transistors not always powered on. Doping the Si substrate to turn it into transistors is not going to change the heat profile, so I don't think dummy transistors are added on purpose for heat management. Happy to be proven wrong though.

	▲	exmadscientist 2 days ago \| parent [-]
		My understanding is that pretty much every possible combination of these things is found somewhere in a modern chip. There are dummy transistors, dark transistors, slow transistors... everything. Somewhere.

▲

drob518 3 days ago | parent | prev [-]

It has turned out to be a pretty good rule of thumb over the decades.

▲

rasz 3 days ago | parent | prev [-]

Not a lot is not how I would describe it. Take a 64bit piece of fetched data. On ARM64 you will just push that into two decoder blocks and be done with it. On x86 you got what, 1 to 15 bytes range per instruction? I dont even want to think about possible permutations, its in the 10 ^ some two digit number order.

▲

mohinder 3 days ago | parent | next [-]

You don't need all the permutations. If there are 32 bytes in a cache line then each instruction can only start at one of 32 possible positions. Then if you want to decode N instructions per cycle you need N 32-to-1 muxes. You can reduce the number of inputs to the later muxes since instructions can't be zero size.

	▲	monocasa 3 days ago \| parent [-]
		It was even simpler until very recently where the decode stage would only look at a max 16 byte floating window.

▲

saagarjha 3 days ago | parent | prev | next [-]

Yes, but you're not describing it from the right position. Is instruction decode hard? Yes, if you think about it in isolation (also, fwiw, it's not a permutation problem as you suggest). But the core has a bunch of other stuff it needs to do that is far harder. Even your lowliest Pentium from 2000 can do instruction decode.

▲

ahartmetz 3 days ago | parent | prev [-]

It's a lot for a decoder, but not for a whole core. Citation needed, but I remember that the decoder is about 10% of a Ryzen core's power budget, and of course that is with a few techniques better than complete brute force.