▲ | alwahi 4 days ago | ||||||||||||||||
okay, I am not a systems level programmer but I am currently learning c with the aim of doing some gpgpu programming using Cuda etc., what is a tile level gpu kernel programming language? and how is it different from something like cuda? I know i can ask a llm or search on google, but i was hoping someone in the community could explain it in a way i could understand. | |||||||||||||||||
▲ | atq2119 4 days ago | parent | next [-] | ||||||||||||||||
I'd say the main difference is that in traditional GPU languages, the thread of execution is a single lane of a warp or wave. You typically work with ~fp32-sized values, and those are mapped by the compiler to one lane of a 32-wide vector register in a wave (or 16- to 128-wide depending on the architecture). Control flow often has to be implemented through implicit masking as different threads mapped to lanes of the same vector can make different control flow decisions (that is, an if statement in the source program gets compiled to an instruction sequence that uses masking in some way - the details vary by vendor). In tile languages, the thread of execution is an entire workgroup (or block in CUDA-speak). You typically work with large vector/matrix-sized values. The compiler decides how to distribute those values onto vector registers across waves of the workgroup. (Example: if your program has a value that is a 32x32 matrix of fp32 elements and a workgroup has 8 32-wide waves, the value will be implemented as 4 standard-sized vector registers in each wave of the workgroup.) All control flow affects the entire workgroup equally since the ToE is the entire workgroup, and so the compiler does not have to do implicit masking. Instead, tile languages usually have provisions for explicit masking using boolean vectors/matrices. Tile languages are a new phenomenon and clearly disagree on what the exact level of abstraction should be. For example, Triton mostly hides the details of shared memory from the programmer and lets the compiler take care of software pipelined loads, while in this Tilus here, it looks like the programmer has to program shared memory use explicitly. | |||||||||||||||||
▲ | taminka 4 days ago | parent | prev | next [-] | ||||||||||||||||
programming w/ tiles means working w/ contiguous-ish blocks/squares of data, to minimise cache misses (often major bottleneck in gpu programming), so it's just a nod to the fact that optimisations like these are included in the language or some such... | |||||||||||||||||
▲ | torginus 4 days ago | parent | prev | next [-] | ||||||||||||||||
It's like writing code directly for the GPU's DSP-like SIMD cores in assembly, instead of taking the CUDA model of targeting a single SIMD thread, from which the compiler figures out how to write assembly for the core itself. | |||||||||||||||||
▲ | socalgal2 3 days ago | parent | prev [-] | ||||||||||||||||
Maybe this is my dad speaking through me when he got tired of answering questions and said "look it up" and pointed to our bookshelf but.... Copying and pasting your exact words above into an LLM (gemini/chatgpt) provided an answer arguably better than any of the human answers at the time of this post. | |||||||||||||||||
|