Remix.run Logo
sfifs a day ago

So a lot depends on your specific use case but mid-sized open weight models are pretty actually good now, so this is realistic [1]

The first question to ask is does your use case require handling personal or sensitive data.

If you're using the LLM for OpenClaw or you want to handle sensitive or medical data, a local model generally is necessary.

if it's not so sensitive - Cloud providers with some sort of user agreement guarantee on not using your data for training would be the next bet. I personally generally use Gemini or Sonnet as my cloud backup. As I understand, OpenAI, Cloudflare (which bought replicate) and Qwen also seem to provide such guarantees and make SOTA models available. Others like DeepSeek seem to have an opt-out setting. Open router & co I avoid except for benchmarking models with public or dummy data as there is absolutely zero guarantee or ability to enforce terms on providers where your data might be sent.

Gemini and Anthropic (and OpenAI) tend to be expensive - it's very easy to run up 15 dollars a day or so bills which puts you solidly in 1 year pay out on Mac Mini territory - at this point I decided to buy. Gemini Flash Lite 3.1 is however surprisingly good value.

the next question is Mac or CUDA. If your expected use is serving LLM models for inferences, the latest large memory Macs give pretty good inference speed (better than DGX Spark) at a reasonable cost - I think there offer much better value than CUDA if the only use case is LLM inference & harnesses.

if you plan to also fine tune models, experiment with other types of ML on GPUs, do computer vision stuff etc. the development tooling on CUDA is far in advance of all other platforms.

Lastly if you choose CUDA, the question is GB10 family (DGX Spark - cluster able with 128Gb RAM et all) or dedicated GPUs workstations. What I found is practically any serious models weighs in requiring at least 96GB VRAM - Antirez's 2 bit quant of Deepseek 4 flash (my current daily driver) [2] , the Qwen 3.5 122B A10B 4-bit quant, the Qwen 3.6 27B Dense and 35B A3B 8 but quants etc. So you're well out of the consumer GPU territory into 1 or more RTX 6000 Pros or Data center grade devices. Yes you can try to hack away with multiple consumer cards or SSD streaming but it's very fiddly and you probably have better things to do with your life.

The GB10 system - which I ultimately went with - is certainly much cheaper and can be clustered through the Special NVLink cable to get 256, 384 or 512 GB setups but comes with severely constrained bandwidth. The Pro GPUs blast these out of water on performance but are expensive.

Lastly, renting a cloud GPU machine doesn't really make sense except to run already debugged fine tuning workloads. You'll probably spend at least 4 dollar an hour for sufficient capacity which if it's personal use, will mostly sit idle.

1. https://srinathh.medium.com/mid-size-local-models-are-now-co...

2. https://github.com/antirez/ds4