Remix.run Logo
ckastner 4 days ago

A lot of this coincides with my own experiments I did to pass-through consumer AMD GPUs into VMs [1], which the Debian ROCm Team uses in their CI.

The Debian package rocm-qemu-support ships scripts that facilitate most of this. I've since generalized this by adding NVIDIA support, but I haven't uploaded the new gpuisol-qemu package [2] to the official Archive yet. It still needs some polishing.

Just dumping this here, to add more references (especially the further reading section, the Gentoo and Arch wikis had a lot of helpful data).

[1]: https://salsa.debian.org/rocm-team/community/team-project/-/...

[2]: https://salsa.debian.org/ckk/gpu-isolation-tools

latchkey 4 days ago | parent [-]

A couple open relevant issues here:

https://github.com/amd/MxGPU-Virtualization/issues/6

https://github.com/amd/MxGPU-Virtualization/issues/16

ckastner 4 days ago | parent [-]

Coincidentally, the first issue (referencing Navi 21) was the one I started these experiments with, and this turned out to be pretty informative.

Our Navi 21 would almost always go AWOL after a test run had been completed, requiring a full reboot. At some point, I noticed that this only happened when our test runner was driving the test; I never had an issue when testing interactively. I eventually realized that our test driver was simply killing the VM when the test was done, which is fine for a CPU-based test, but this messed with the GPU's state. When working interactively, I was always shutting down the host cleanly, which apparently resolved this. A patch to our test runner to cleanly shut down VMs fixed this.

And I've had no luck with iGPUs, as referenced by the second issue.

From what I understand, I don't think that consumer AMD GPUs can/will ever be fully supported, because the GPU reset mechanisms of older cards are so complex. That's why things like vendor-reset [3] exist, which apparently duplicate a lot of the in-kernel driver code but ultimately only twiddle some bits.

[3]: https://github.com/gnif/vendor-reset