Remix.run Logo
sweetjuly 13 hours ago

Yes, but this is not cheap for hardware. CPU designers love SIMD because it lets them just slap down ALUs in parallel and get 32x performance boosts. Reductions, however, are not entirely parallel and instead have a relatively huge gate depth.

For example, suppose an add operation has a latency of one unit in some silicon process. To add reduce a 32 element vector, you'll have a five deep tree, which means your operation has a latency of five units. You can pipeline this, but you can't solve the fact that this operation has a 5x higher latency than the non-reduce operations.

adgjlsfhk1 10 hours ago | parent [-]

for some, you can cheat a little (e.g. partial sum reduction), but then you don't get to share IP with the rest of your ALUs. I do really want to see what an optimal 32 wide reduction and circuit looks like. For integers, you pretty clearly can do much better than pairwise reduction. float sounds tricky