▲ | saagarjha 3 days ago | |
> Due to improvements in newer hardware, you might need to use more tricks to reach Speed-of-Light on older GPUs e.g. pipeline shared memory to register memory data movements. On the contrary, older GPUs are a lot easier to hit rooflines on. Newer GPUs run so fast that they keep adding new tricks to remove bottlenecks. Not to discount the author's work here but a 5090 is pretty bad on the FLOPs/memory bandwidth ratio so it's comparatively easier to get throttled by tensor cores there; on datacenter hardware your tensor cores are so fast that you'll hit limits that were glossed over here. For example, using Ampere "mma" instructions won't cut it, because they compute a really small MMA and force your input to live in registers. You'll need TMA to get data into shared memory and wmma to do a matrix multiply out of them. At those speeds you will run into issues with dispatching instructions and computing addresses (and doing out-of-bounds calculation) fast enough that you will need to offload it to specialized hardware to keep up with the tensor cores. |