> Any modern out-of-order core will (on the happy path) schedule the operations identically whether you did one copy per loop or four.
I cannot agree because in an unrolled loop you have less counter increment instructions.