kinda? this is sort of a direct result of the 'vectors are just sliced registers' model. if i do a pairwise operation and divide my domain by 2 at each step, is the resulting vector sparse or dense? if its dense then I only really top out when i'm in the last log2slice steps.

▲

sweetjuly 3 months ago | parent [-]

Yes, but this is not cheap for hardware. CPU designers love SIMD because it lets them just slap down ALUs in parallel and get 32x performance boosts. Reductions, however, are not entirely parallel and instead have a relatively huge gate depth.

For example, suppose an add operation has a latency of one unit in some silicon process. To add reduce a 32 element vector, you'll have a five deep tree, which means your operation has a latency of five units. You can pipeline this, but you can't solve the fact that this operation has a 5x higher latency than the non-reduce operations.

	▲	adgjlsfhk1 3 months ago \| parent [-]
		for some, you can cheat a little (e.g. partial sum reduction), but then you don't get to share IP with the rest of your ALUs. I do really want to see what an optimal 32 wide reduction and circuit looks like. For integers, you pretty clearly can do much better than pairwise reduction. float sounds tricky