Remix clone Hacker News

new | show | ask | jobs Github

	▲	kazinator 3 days ago
		> In other words [because the access sequence is just 10 instructions], memory will be the bottleneck, not the instructions to calculate where an index is. Ha, that is wishful thinking. If you do this in a tight loop in which everything is in the L1 cache, the instructions hurt! "Memory bandwidth is the bottleneck" reasoning applies when you access bulk data without localized repetition.
	▲	HelloNurse 3 days ago \| parent [-]
		Those 10 instructions are for one access, not for a tight loop. A tight loop could be done with a much more complex macro that iterates separately in each segment, amortizing the overhead.