The performance purist don't use Cuda either though (that's why Deepseek used PTX directly).

Everything is an abstraction and choosing the right level of abstraction for your usecase is a tradeoff between your engineering capacities and your performance needs.

▲ LowLevelMahn 6 days ago | parent | next [-]

this Rust demo also uses PTX directly

  During the build, build.rs uses rustc_codegen_nvvm to compile the GPU kernel to PTX.
  The resulting PTX is embedded into the CPU binary as static data.
  The host code is compiled normally.

▲

LegNeato 6 days ago | parent | next [-]

To be more technically correct, we compile to NVVM IR and then use NVIDIA's NVVM to convert it to PTX.

▲

saagarjha 5 days ago | parent | prev [-]

That’s not really the same thing; it compiles through PTX rather than using inline assembly.

	▲	LegNeato 18 hours ago \| parent [-]
		FYI, you can drop down into ptx if need be: https://github.com/Rust-GPU/Rust-CUDA/blob/aa7e61512788cc702...

▲ brandonpelfrey 6 days ago | parent | prev [-]

The issue in my mind is that this doesn’t seem to include any of the critical library functionality specific eg to NVIDIA cards, think reduction operations across threads in a warp and similar. Some of those don’t exist in all hardware architectures. We may get to a point where everything could be written in one language but actually leveraging the hardware correctly still requires a bunch of different implementations, ones for each target architecture.

The fact that different hardware has different features is a good thing.

	▲	rbanffy 6 days ago \| parent [-]
		The features missing hardware support can fall back to software implementations. In any case, ideally, the level of abstraction would be higher, with little application logic requiring GPU architecture awareness.