Loop unrolling isn't really done because of pipelining but rather to amortize the cost of looping. Any modern out-of-order core will (on the happy path) schedule the operations identically whether you did one copy per loop or four. The only difference is the number of branches.

▲

Remnant44 3 months ago | parent | next [-]

These days, I strongly believe that loop unrolling is a pessimization, especially with SIMD code.

Scalar code should be unrolled by the compiler to the SIMD word width to expose potential parallelism. But other than that, correctly predicted branches are free, and so is loop instruction overhead on modern wide-dispatch processors. For example, even running a maximally efficient AVX512 kernel on a zen5 machine that dispatches 4 EUs and some load/stores and calculates 2048 bits in the vector units every cycle, you still have a ton of dispatch capacity to handle the loop overhead in the scalar units.

The cost of unrolling is decreased code density and reduced effectiveness of the instruction / uOp cache. I wish Clang in particular would stop unrolling the dang vector loops.

▲

bobmcnamara 3 months ago | parent | next [-]

> The cost of unrolling is decreased code density and reduced effectiveness of the instruction / uOp cache.

There are some cases where useful code density goes up.

Ex: unroll the Goertzel algorithm by a even number, and suddenly the entire delay line overhead evaporates.

▲

adgjlsfhk1 3 months ago | parent | prev | next [-]

The part that's really weird is that on modern CPUs predicted branches are free iff they're sufficiently rare (<1 out of 8 instructions or so). but if you have too many, you will be bottlenecked on the branch since you aren't allowed to speculate past a 2nd (3rd on zen5 without hyperthreading?) branch.

▲

dzaima 3 months ago | parent [-]

The limiting thing isn't necessarily speculating, but more just the number of branches per cycle, i.e. number of non-contiguous locations the processor has to query from L1 / uop cache (and which the branch predictor has to determine the location of). You get that limit with unconditional branches too.

▲

gpderetta 3 months ago | parent [-]

Indeed, the limit is on taken branches, hence why making the most likely case fall through is often an optimization.

	▲	adgjlsfhk1 2 months ago \| parent [-]
		The tricky part here is that compilers are pretty bad (without PGO at least) of knowing what side of the branch matters.

▲

dzaima 3 months ago | parent | prev | next [-]

Intel still shares ports between vector and scalar on P-cores; a scalar multiply in the loop will definitely fight with a vector port, and the bits of pointer bumps and branch and whatnot can fill up the 1 or 2 scalar-only ports. And maybe there are some minor power savings from wasting resources on the scalar overhead. Still, clang does unroll way too much.

▲

Remnant44 3 months ago | parent [-]

My understanding is that they've changed this for Lion Cove and all future P cores, moving to much more of a Zen-like setup with seperate schedulers and ports for vector and scalar ops.

	▲	dzaima 3 months ago \| parent [-]
		Oh, true, mistook it for an E-core while clicking through diagrams due to the port spam.. Still, that's a 2024 microarchirecture, it'll be like a decade before it's reasonable to ignore older ones.

▲

Const-me 3 months ago | parent | prev [-]

> schedule the operations identically whether you did one copy per loop or four

They don’t always do that well when you need a reduction in that loop, e.g. you are searching for something in memory, or computing dot product of long vectors.

Reductions in the loop form a continuous data dependency chain between loop iteration, which prevents processors from being able to submit instructions for multiple iterations of the loop. Fixable with careful manual unrolling.

▲

gpderetta 3 months ago | parent | prev | next [-]

The looping overhead is trivial, especially on simd code where the loop overhead will use the scalar hardware.

Unrolling is definitely needed for properly scheduling and pipelining SIMD code even on OoO cores. Remember that an OoO core cannot reorder dependent instructions, so the dependencies need to be manually broken, for example by adding additional accumulators, which in turn requires additional unrolling, this is especially important on SIMD code which typically is non-branchy with long dependency chains.

	▲	Remnant44 3 months ago \| parent [-]
		That's a good point about increased dependency chain length in simd due to the branchless programming style. Unrolling to break a loop-carried dependency is one of the strongest reasons to unroll especially simd code. Unrolling trivial loops to remove loop counter overhead hasn't been productive for quite a whole now but unfortunately it's still the default for many compilers.

▲

codedokode 3 months ago | parent | prev | next [-]

> Any modern out-of-order core will (on the happy path) schedule the operations identically whether you did one copy per loop or four.

I cannot agree because in an unrolled loop you have less counter increment instructions.

▲

imtringued 3 months ago | parent | prev [-]

Ok, but the compiler can't do that without unrolling.