See also: https://tinygrad.org/
Reverse-engineered python-only GPU API, works with not only CUDA but Also AMD's ROCm
Other runtimes: https://docs.tinygrad.org/runtime/#runtimes