▲ | convolvatron 3 months ago | |||||||
kinda? this is sort of a direct result of the 'vectors are just sliced registers' model. if i do a pairwise operation and divide my domain by 2 at each step, is the resulting vector sparse or dense? if its dense then I only really top out when i'm in the last log2slice steps. | ||||||||
▲ | sweetjuly 3 months ago | parent [-] | |||||||
Yes, but this is not cheap for hardware. CPU designers love SIMD because it lets them just slap down ALUs in parallel and get 32x performance boosts. Reductions, however, are not entirely parallel and instead have a relatively huge gate depth. For example, suppose an add operation has a latency of one unit in some silicon process. To add reduce a 32 element vector, you'll have a five deep tree, which means your operation has a latency of five units. You can pipeline this, but you can't solve the fact that this operation has a 5x higher latency than the non-reduce operations. | ||||||||
|