▲ | adgjlsfhk1 21 hours ago | ||||||||||||||||
The tricky part with reductions is that they are somewhat inherently slow since they often need to be done pairwise and a pairwise reduction over 16 elements will naturally have pretty limited parallelism. | |||||||||||||||||
▲ | convolvatron 21 hours ago | parent [-] | ||||||||||||||||
kinda? this is sort of a direct result of the 'vectors are just sliced registers' model. if i do a pairwise operation and divide my domain by 2 at each step, is the resulting vector sparse or dense? if its dense then I only really top out when i'm in the last log2slice steps. | |||||||||||||||||
|