Remix.run Logo
atq2119 4 days ago

I'd say the main difference is that in traditional GPU languages, the thread of execution is a single lane of a warp or wave. You typically work with ~fp32-sized values, and those are mapped by the compiler to one lane of a 32-wide vector register in a wave (or 16- to 128-wide depending on the architecture). Control flow often has to be implemented through implicit masking as different threads mapped to lanes of the same vector can make different control flow decisions (that is, an if statement in the source program gets compiled to an instruction sequence that uses masking in some way - the details vary by vendor).

In tile languages, the thread of execution is an entire workgroup (or block in CUDA-speak). You typically work with large vector/matrix-sized values. The compiler decides how to distribute those values onto vector registers across waves of the workgroup. (Example: if your program has a value that is a 32x32 matrix of fp32 elements and a workgroup has 8 32-wide waves, the value will be implemented as 4 standard-sized vector registers in each wave of the workgroup.) All control flow affects the entire workgroup equally since the ToE is the entire workgroup, and so the compiler does not have to do implicit masking. Instead, tile languages usually have provisions for explicit masking using boolean vectors/matrices.

Tile languages are a new phenomenon and clearly disagree on what the exact level of abstraction should be. For example, Triton mostly hides the details of shared memory from the programmer and lets the compiler take care of software pipelined loads, while in this Tilus here, it looks like the programmer has to program shared memory use explicitly.