Remix.run Logo
kazinator 3 days ago

> In other words [because the access sequence is just 10 instructions], memory will be the bottleneck, not the instructions to calculate where an index is.

Ha, that is wishful thinking. If you do this in a tight loop in which everything is in the L1 cache, the instructions hurt!

"Memory bandwidth is the bottleneck" reasoning applies when you access bulk data without localized repetition.

HelloNurse 3 days ago | parent [-]

Those 10 instructions are for one access, not for a tight loop. A tight loop could be done with a much more complex macro that iterates separately in each segment, amortizing the overhead.