Remix.run Logo
dragontamer 2 days ago

8 Integer ALUs, 4 Vector FPUs, 8x L1 d-caches but only 4x L2 d-Caches.

And perhaps most importantly: 4x decoders/4x L1 iCache. IIRC, the entire damn chip was decoder-bound.

--------

Note: AMD Zen has 4x Integer pipelines and 4x FPU pipelines __PER CORE__. Modern high-performance systems CANNOT have a single 2x-pipeline FPU shared between two cores (averaging one pipeline per core). Modern Zen is closer to 4x pipelines per core, maybe more depending on how you count load/store units.

2 days ago | parent | next [-]
[deleted]
dannyw 2 days ago | parent | prev [-]

Yup. The limited decoders meant your pipeline just wasn’t flowing every cycle, because many of the stages were sitting idle.

dragontamer 2 days ago | parent [-]

Note that Intel's modern e-Core has 3x decoders per core. When code is straight, they alternate (decoder#1 / decoder#2 / decoder#3). When code is branchy, they split up across different jumps aka if/else statements.

Shrinking the decoder on Bulldozer was clearly the wrong move for Fx-series / AMD. Today's chips are going wide decoder (ex: Apple can do 8x decode per clock tick), deep opcode cache (AMD Zen has a large opcode cache allowing for 6x way lookup per clocktick), or Intel's new and interesting multiple-decoder thing.

sidewndr46 2 days ago | parent [-]

How do you know the behavior of the decoding portion of Intel's E-core's? Do you work for them?

AlotOfReading 2 days ago | parent | next [-]

People use clever code to tease out microarchitectural details and scour through public information to with these things out. Agner Fog is one example. His microarch analysis documents 3x decoders for the Tremont microarch, predecessor to gracemont (what's currently used for E-cores).

https://www.agner.org/optimize/microarchitecture.pdf

zokier 2 days ago | parent | prev | next [-]

The architectures of Intel cores is widely discussed and publicized. Here are the some details for the e-cores mentioned: https://chipsandcheese.com/p/skymont-intels-e-cores-reach-fo...

> Leapfrogging fetch and decode clusters have been a distinguishing feature of Intel’s E-Core line ever since Tremont. Skymont doubles down by adding another decode cluster, for a total of three clusters capable of decoding a total of nine instructions per cycle.

dragontamer 2 days ago | parent | prev [-]

Intel tells you this in their optimization manuals and white papers.

They want you to write code that takes advantage of their speedups. Agner Fog is a better writer (a sibling comment already linked to Agner Fogs stuff). But I also like referencing the official manuals and whitepapers as a primary source document.

Hard to beat Intels documents on Intel chips after all.