▲ | Remnant44 21 hours ago | ||||||||||||||||
These days, I strongly believe that loop unrolling is a pessimization, especially with SIMD code. Scalar code should be unrolled by the compiler to the SIMD word width to expose potential parallelism. But other than that, correctly predicted branches are free, and so is loop instruction overhead on modern wide-dispatch processors. For example, even running a maximally efficient AVX512 kernel on a zen5 machine that dispatches 4 EUs and some load/stores and calculates 2048 bits in the vector units every cycle, you still have a ton of dispatch capacity to handle the loop overhead in the scalar units. The cost of unrolling is decreased code density and reduced effectiveness of the instruction / uOp cache. I wish Clang in particular would stop unrolling the dang vector loops. | |||||||||||||||||
▲ | bobmcnamara 17 hours ago | parent | next [-] | ||||||||||||||||
> The cost of unrolling is decreased code density and reduced effectiveness of the instruction / uOp cache. There are some cases where useful code density goes up. Ex: unroll the Goertzel algorithm by a even number, and suddenly the entire delay line overhead evaporates. | |||||||||||||||||
▲ | Const-me 4 hours ago | parent | prev | next [-] | ||||||||||||||||
> schedule the operations identically whether you did one copy per loop or four They don’t always do that well when you need a reduction in that loop, e.g. you are searching for something in memory, or computing dot product of long vectors. Reductions in the loop form a continuous data dependency chain between loop iteration, which prevents processors from being able to submit instructions for multiple iterations of the loop. Fixable with careful manual unrolling. | |||||||||||||||||
▲ | adgjlsfhk1 20 hours ago | parent | prev | next [-] | ||||||||||||||||
The part that's really weird is that on modern CPUs predicted branches are free iff they're sufficiently rare (<1 out of 8 instructions or so). but if you have too many, you will be bottlenecked on the branch since you aren't allowed to speculate past a 2nd (3rd on zen5 without hyperthreading?) branch. | |||||||||||||||||
| |||||||||||||||||
▲ | dzaima 20 hours ago | parent | prev [-] | ||||||||||||||||
Intel still shares ports between vector and scalar on P-cores; a scalar multiply in the loop will definitely fight with a vector port, and the bits of pointer bumps and branch and whatnot can fill up the 1 or 2 scalar-only ports. And maybe there are some minor power savings from wasting resources on the scalar overhead. Still, clang does unroll way too much. | |||||||||||||||||
|