Remix.run Logo
convolvatron a day ago

i would certainly add lack of reductions ('horizontal' operations) and a more generalized model of communication to the list.

adgjlsfhk1 21 hours ago | parent [-]

The tricky part with reductions is that they are somewhat inherently slow since they often need to be done pairwise and a pairwise reduction over 16 elements will naturally have pretty limited parallelism.

convolvatron 21 hours ago | parent [-]

kinda? this is sort of a direct result of the 'vectors are just sliced registers' model. if i do a pairwise operation and divide my domain by 2 at each step, is the resulting vector sparse or dense? if its dense then I only really top out when i'm in the last log2slice steps.

sweetjuly 13 hours ago | parent [-]

Yes, but this is not cheap for hardware. CPU designers love SIMD because it lets them just slap down ALUs in parallel and get 32x performance boosts. Reductions, however, are not entirely parallel and instead have a relatively huge gate depth.

For example, suppose an add operation has a latency of one unit in some silicon process. To add reduce a 32 element vector, you'll have a five deep tree, which means your operation has a latency of five units. You can pipeline this, but you can't solve the fact that this operation has a 5x higher latency than the non-reduce operations.

adgjlsfhk1 10 hours ago | parent [-]

for some, you can cheat a little (e.g. partial sum reduction), but then you don't get to share IP with the rest of your ALUs. I do really want to see what an optimal 32 wide reduction and circuit looks like. For integers, you pretty clearly can do much better than pairwise reduction. float sounds tricky