▲ | Const-me 9 hours ago | |
> schedule the operations identically whether you did one copy per loop or four They don’t always do that well when you need a reduction in that loop, e.g. you are searching for something in memory, or computing dot product of long vectors. Reductions in the loop form a continuous data dependency chain between loop iteration, which prevents processors from being able to submit instructions for multiple iterations of the loop. Fixable with careful manual unrolling. |