Remix.run Logo
smallnamespace 5 hours ago

It’s architecturally not a good approach. System RAM is much slower so you should put data that doesn’t need to be used often on it. That knowledge is at the application layer. Adding a CUDA shim makes system RAM appear like VRAM, which gets things to run, but it will never run very well.

The benchmarks at the bottom mention memory tiering and manually controlling where things go, but if your application already does that, then you probably don’t also need a CUDA shim. The application should control the VRAM to system memory transfers with boring normal code.

jbverschoor 11 minutes ago | parent | next [-]

Not true for unified systems. And for strix halo you need to dedicate the amount which is annoying.

You’re basically stating that swapping is also a bad idea. And to take it further, any memory or storage is a bad idea because there’s L1 cache/SRAM which is faster then the rest

timnetworks 3 hours ago | parent | prev | next [-]

Some people are not concerned with having it run the fastest, just having it run at all may be enough.

m-schuetz 2 hours ago | parent | next [-]

From my experience, accessing system RAM from the GPU is so slow, it might as well count as "does not work". It's orders of magnitudes faster to memcpy large swaths of memory that you are going to use to the GPU, rather than accessing system mem from a kernel which then takes ages to wait for that small block/page of memory, then waits again for the next small page/block of memory, etc. Latency hiding doesnt work anymore if the latency is that large.

nl 2 hours ago | parent | prev [-]

But then you can use CPU/RAM offload, which already allows you to offload without a kernel module.

midnitewarrior 3 hours ago | parent | prev [-]

[dead]