Remix.run Logo
pbsd 4 days ago

Vector ALU instruction latencies are understandably listed as 2 and higher, but this is not strictly the case. From AMD's Zen 5 optimization manual [1], we have

    The floating point schedulers have a slow region, in the oldest entries of a scheduler and only when the scheduler is full. If an operation is in the slow region and it is dependent on a 1-cycle latency operation, it will see a 1 cycle latency penalty.
    There is no penalty for operations in the slow region that depend on longer latency operations or loads.
    There is no penalty for any operations in the fast region.
    To write a latency test that does not see this penalty, the test needs to keep the FP schedulers from filling up.
    The latency test could interleave NOPs to prevent the scheduler from filling up.
Basically, short vector code sequences that don't fill up the scheduler will have better latency.

[1] https://www.amd.com/content/dam/amd/en/documents/processor-t...

Dylan16807 4 days ago | parent [-]

So if you fill up the scheduler with a long line of dependent instructions, you experience a significant slowdown? I wonder why they decided to make it do that instead of limiting size/fill by a bit. What all the tradeoffs were.