I think the idea is to allow developers to write a single implementation and have a portable binary that can run on any kind of hardware.

We do that all the time - there are lots of code that chooses optimal code paths depending on runtime environment or which ISA extensions are available.

▲ pjmlp 6 days ago | parent | next [-]

Without the tooling though.

Commendable effort, however just like people forget languages are ecosystems, they tend to forget APIs are ecosystems as well.

▲ kookamamie 6 days ago | parent | prev [-]

Sure. The performance-purist in me would be very doubtful about the result's optimality, though.

▲ littlestymaar 6 days ago | parent [-]

The performance purist don't use Cuda either though (that's why Deepseek used PTX directly).

Everything is an abstraction and choosing the right level of abstraction for your usecase is a tradeoff between your engineering capacities and your performance needs.

▲ LowLevelMahn 6 days ago | parent | next [-]

this Rust demo also uses PTX directly

  During the build, build.rs uses rustc_codegen_nvvm to compile the GPU kernel to PTX.
  The resulting PTX is embedded into the CPU binary as static data.
  The host code is compiled normally.

▲

LegNeato 6 days ago | parent | next [-]

To be more technically correct, we compile to NVVM IR and then use NVIDIA's NVVM to convert it to PTX.

▲

saagarjha 5 days ago | parent | prev [-]

That’s not really the same thing; it compiles through PTX rather than using inline assembly.

	▲	LegNeato 18 hours ago \| parent [-]
		FYI, you can drop down into ptx if need be: https://github.com/Rust-GPU/Rust-CUDA/blob/aa7e61512788cc702...

▲ brandonpelfrey 6 days ago | parent | prev [-]

The issue in my mind is that this doesn’t seem to include any of the critical library functionality specific eg to NVIDIA cards, think reduction operations across threads in a warp and similar. Some of those don’t exist in all hardware architectures. We may get to a point where everything could be written in one language but actually leveraging the hardware correctly still requires a bunch of different implementations, ones for each target architecture.

The fact that different hardware has different features is a good thing.

	▲	rbanffy 6 days ago \| parent [-]
		The features missing hardware support can fall back to software implementations. In any case, ideally, the level of abstraction would be higher, with little application logic requiring GPU architecture awareness.