Remix.run Logo
btown 14 days ago

The GTC 2025 announcement session that's mentioned in this article has video here: https://www.nvidia.com/en-us/on-demand/session/gtc25-s72383/

It's a holistic approach to all levels of the stack, from high-level frameworks to low-level bindings, some of which is highlighting existing libraries, and some of which are completely newly announced.

One of the big things seems to be a brand new Tile IR, at the level of PTX and supported with a driver level JIT compiler, and designed for Python-first semantics via a new cuTile library.

https://x.com/JokerEph/status/1902758983116657112 (without login: https://xcancel.com/JokerEph/status/1902758983116657112 )

Example of proposed syntax: https://pbs.twimg.com/media/GmWqYiXa8AAdrl3?format=jpg&name=...

Really exciting stuff, though with the new IR it further widens the gap that projects like https://github.com/vosen/ZLUDA and AMD's own tooling are trying to bridge. But vendor lock-in isn't something we can complain about when it arises from the vendor continuing to push the boundaries of developer experience.

skavi 14 days ago | parent [-]

i’m curious what advantage is derived from this existing independently of the PTX stack? i.e. why doesn’t cuTile produce PTX via a bundled compiler like Triton or (iirc) Warp?

Even if there is some impedance mismatch, could PTX itself not have been updated?

cavisne 14 days ago | parent [-]

In the presentation they said eventually kernels can share SIMT (PTX) and TileIR but not at launch. It seems pretty mysterious why they don't just emit PTX, I would guess they are either taking the opportunity to clean things up for ML tensorcore workloads or there is some HW specific features coming that they only want to enable through TileIR.

skavi 12 days ago | parent [-]

if i were to lean into cynicism, i might suggest this choice was meant to increase the effort required to reimplement cuda for other cards.