Remix.run Logo
convolvatron 3 months ago

i would certainly add lack of reductions ('horizontal' operations) and a more generalized model of communication to the list.

dzaima 3 months ago | parent | next [-]

ARM NEON does have sum, min, and max reductions (and/or reductions can just be min/max if all bits in elements are the same), along with pairwise ops. RVV has sum,min,max,and,or,xor reductions. x86 has psadbw which sums windows of eight 8-bit ints, and various instructions for some pairwise horizontal stuff.

But in general building code around reductions isn't really a thing you'd ideally do; they're necessarily gonna have higher latency / lower throughput / take more silicon compared to avoiding them where possible, best to leave reducing to a single element to the loop tail.

adgjlsfhk1 3 months ago | parent | prev [-]

The tricky part with reductions is that they are somewhat inherently slow since they often need to be done pairwise and a pairwise reduction over 16 elements will naturally have pretty limited parallelism.

convolvatron 3 months ago | parent [-]

kinda? this is sort of a direct result of the 'vectors are just sliced registers' model. if i do a pairwise operation and divide my domain by 2 at each step, is the resulting vector sparse or dense? if its dense then I only really top out when i'm in the last log2slice steps.

sweetjuly 3 months ago | parent [-]

Yes, but this is not cheap for hardware. CPU designers love SIMD because it lets them just slap down ALUs in parallel and get 32x performance boosts. Reductions, however, are not entirely parallel and instead have a relatively huge gate depth.

For example, suppose an add operation has a latency of one unit in some silicon process. To add reduce a 32 element vector, you'll have a five deep tree, which means your operation has a latency of five units. You can pipeline this, but you can't solve the fact that this operation has a 5x higher latency than the non-reduce operations.

adgjlsfhk1 3 months ago | parent [-]

for some, you can cheat a little (e.g. partial sum reduction), but then you don't get to share IP with the rest of your ALUs. I do really want to see what an optimal 32 wide reduction and circuit looks like. For integers, you pretty clearly can do much better than pairwise reduction. float sounds tricky