Remix.run Logo
villgax 6 hours ago

That’s always been possible with MPS backend, the reason people choose to omit it in HF spaces/demos is that HF doesn’t offer an MPS backend. People would rather have the thing work at best speeds than 10x worse speeds just for compatibility.

shivampkumar 4 hours ago | parent | next [-]

IMO TRELLIS.2 is slightly different case from the HF models scenario. It depends on five compiled CUDA-only extensions -- flex_gemm for sparse convolution, flash_attn, o_voxel for CUDA hashmap ops, cumesh for mesh processing, and nvdiffrast for differentiable rasterization. These aren't PyTorch ops that fall back to MPS -- they're custom C++/CUDA kernels. The upstream setup.sh literally exits with "No supported GPU found" if nvidia-smi isn't present. The only reason I picked this up because I thought it was cool and no one was working on this open issue for Silicon back then (github.com/microsoft/TRELLIS.2/issues/74) requesting non-CUDA support.

Reubend 5 hours ago | parent | prev | next [-]

Are you saying the original one worked with MPS? Or are you just saying it was always theoretically possible to build what OP posted?

refulgentis 6 hours ago | parent | prev [-]

It’s always been possible, but it’s not possible because there’s no backend, and no one wants to it to be possible because everyone needs it 10x the speed of running on a Mac? I’m missing something, I think.

shivampkumar 4 hours ago | parent [-]

I thought it was cool and then I found the open issue mentioned above, that convinced me its def something more people want.

It IS significantly slower, about 3.5 minutes on my MacBook vs seconds on an H100. That's partly the pure-PyTorch backend overhead and partly just the hardware difference.

For my use case the tradeoff works -- iterate locally without paying for cloud GPUs or waiting in queues.