▲ | kazinator 3 days ago | |
> In other words [because the access sequence is just 10 instructions], memory will be the bottleneck, not the instructions to calculate where an index is. Ha, that is wishful thinking. If you do this in a tight loop in which everything is in the L1 cache, the instructions hurt! "Memory bandwidth is the bottleneck" reasoning applies when you access bulk data without localized repetition. | ||
▲ | HelloNurse 3 days ago | parent [-] | |
Those 10 instructions are for one access, not for a tight loop. A tight loop could be done with a much more complex macro that iterates separately in each segment, amortizing the overhead. |