These days, I strongly believe that loop unrolling is a pessimization, especially with SIMD code.

Scalar code should be unrolled by the compiler to the SIMD word width to expose potential parallelism. But other than that, correctly predicted branches are free, and so is loop instruction overhead on modern wide-dispatch processors. For example, even running a maximally efficient AVX512 kernel on a zen5 machine that dispatches 4 EUs and some load/stores and calculates 2048 bits in the vector units every cycle, you still have a ton of dispatch capacity to handle the loop overhead in the scalar units.

The cost of unrolling is decreased code density and reduced effectiveness of the instruction / uOp cache. I wish Clang in particular would stop unrolling the dang vector loops.

▲

bobmcnamara 6 months ago | parent | next [-]

> The cost of unrolling is decreased code density and reduced effectiveness of the instruction / uOp cache.

There are some cases where useful code density goes up.

Ex: unroll the Goertzel algorithm by a even number, and suddenly the entire delay line overhead evaporates.

▲

adgjlsfhk1 6 months ago | parent | prev | next [-]

The part that's really weird is that on modern CPUs predicted branches are free iff they're sufficiently rare (<1 out of 8 instructions or so). but if you have too many, you will be bottlenecked on the branch since you aren't allowed to speculate past a 2nd (3rd on zen5 without hyperthreading?) branch.

▲

dzaima 6 months ago | parent [-]

The limiting thing isn't necessarily speculating, but more just the number of branches per cycle, i.e. number of non-contiguous locations the processor has to query from L1 / uop cache (and which the branch predictor has to determine the location of). You get that limit with unconditional branches too.

▲

gpderetta 6 months ago | parent [-]

Indeed, the limit is on taken branches, hence why making the most likely case fall through is often an optimization.

	▲	adgjlsfhk1 6 months ago \| parent [-]
		The tricky part here is that compilers are pretty bad (without PGO at least) of knowing what side of the branch matters.

▲

dzaima 6 months ago | parent | prev | next [-]

Intel still shares ports between vector and scalar on P-cores; a scalar multiply in the loop will definitely fight with a vector port, and the bits of pointer bumps and branch and whatnot can fill up the 1 or 2 scalar-only ports. And maybe there are some minor power savings from wasting resources on the scalar overhead. Still, clang does unroll way too much.

▲

Remnant44 6 months ago | parent [-]

My understanding is that they've changed this for Lion Cove and all future P cores, moving to much more of a Zen-like setup with seperate schedulers and ports for vector and scalar ops.

	▲	dzaima 6 months ago \| parent [-]
		Oh, true, mistook it for an E-core while clicking through diagrams due to the port spam.. Still, that's a 2024 microarchirecture, it'll be like a decade before it's reasonable to ignore older ones.

▲

Const-me 6 months ago | parent | prev [-]

> schedule the operations identically whether you did one copy per loop or four

They don’t always do that well when you need a reduction in that loop, e.g. you are searching for something in memory, or computing dot product of long vectors.

Reductions in the loop form a continuous data dependency chain between loop iteration, which prevents processors from being able to submit instructions for multiple iterations of the loop. Fixable with careful manual unrolling.