▲ | Unweaving warp specialization on modern tensor core GPUs(rohany.github.io) | |||||||
29 points by rohany 13 hours ago | 4 comments | ||||||||
▲ | liuliu 12 hours ago | parent | next [-] | |||||||
My understanding is that you cannot talk about warp specialization without talking about the alternative: multi-stage pipelining. And the final example code given is multi-stage pipeline with double buffers. And here is my understanding where it differs: 1. multi-stage pipeline requires careful hand-tuning, even at PTX level to make sure your async wait is weaved properly to maximize overlap. 2. since these register files now is huge, multi-stage pipeline is difficult to write at intrinsics level to make efficient use of these huge register files. 3. Warp specialization delegated most of these scheduling dynamically, hence it is better adapted to hardware (and have more information to make scheduling decisions at runtime). Although this is a bit moot because we write different code for different hardware anyway. Anything more I am missing? | ||||||||
| ||||||||
▲ | majke 11 hours ago | parent | prev [-] | |||||||
I always assumed that when one warp waits for results from a long latency instruction, another warp, potentially from another block can be scheduled in. I guess this post assumes the need to use all the gpu resources from within a single block. | ||||||||
|