What happens when you run a CUDA kernel?

orliesaurus 35 minutes ago | parent | next [-]

There are companies whose whole job right now is to optimize kernels so that things run faster. I wonder if those companies are going to be dethroned by some sort of like open source library that can do that really well (I bet Nvidia could release it any day.).. or if they're going to thrive and be acquired by the big providers as a `moat` to speed up their infrerence.

	▲	spmurrayzzz a minute ago \| parent [-]
		Near-term acquihires are certainly a likely bet I think. But given model progress on related benchmarks like kernelbench [1], I do think a set of more commoditized solutions is also inevitable. The caveat though is that each new gen of hardware often comes with brand new constraints/features that a given generation of models haven't seen before (e.g. tcgen05 in blackwell was OOD at one point). As the models start to generalize better, this might not be a showstopper, but still an issue at least currently. [1] https://kernelbench.com/

▲

fooblaster 2 hours ago | parent | prev | next [-]

The hardware has some open documentation. You don't actually need to read the kernel source to find some of the method documentation or qmd formats. See https://github.com/NVIDIA/open-gpu-doc/blob/master/classes/c...

▲

einpoklum 2 hours ago | parent | prev [-]

First - nice writeup which goes into a lot of nooks and crannies.

That said, a lot of the user-space "voodoo" is gone if you don't go through CUDA's "runtime API". If you use the driver API, take your kernel source as a string and compile it with NVIDIA's run-time compiler, you'll have better visibility into a lot (not all) of what's going on. For the "raw" version of this, look at:

https://github.com/NVIDIA/cuda-samples/tree/master/cpp/0_Int...

but for a much more readable, and still fully transparent modern-C++ API version of the same, try this:

https://github.com/eyalroz/cuda-api-wrappers/blob/master/exa...

that's a sample program for my CUDA API wrappers (header-only) library.

	▲	mschuetz an hour ago \| parent [-]
		I like the driver API because it allows treating Cuda kernels like hot-reloadable shaders. It's fun to develop while being able to change the code at runtime.