| ▲ | Joker_vD 4 days ago | |
> You wouldn't want an instruction with up to 13 destinations in high performance designs anyways. Why not? Code density matters even in high-performance designs although I guess the "millicode routines" can help with that somewhat. Still, the ordering of stores/loads is undefined, and they are allowed to be re-done however many times, so... it shouldn't be onerous to implement? Expanding it into μops during the decoding stages seems straightforward. | ||
| ▲ | camel-cdr 4 days ago | parent [-] | |
> Expanding it into μops during the decoding stages seems straightforward. I wouldn't say so, because if you want to be able to crack an instruction into up to N uops, now the second instruction could be placed in any slot from the 2nd to the 1+Nth and you now have to create huge shuffle hardware tk support this. Apple for example can only crack instructions that generate up to 3 μops at decode (or before rename) anything beyond needs to be microcoded and stall decoding other instructions. | ||