Let's say you had a hardware budget of $5,000. What machine would you buy or build to run Devstral Small 2? The HuggingFace page claims it can run on a Mac with 32 GB of memory or an RTX 4090. What kind of tokens per second would you get on each? What about DGX Spark? What about RTX 5090 or Pro series? What about external GPUs on Oculink with a mini PC?

▲ clusterhacks 15 hours ago | parent | next [-]

All those choices seem to have very different trade-offs? I hate $5,000 as a budget - not enough to launch you into higher-VRAM RTX Pro cards, too much (for me personally) to just spend on a "learning/experimental" system.

I've personally decided to just rent systems with GPUs from a cloud provider and setup SSH tunnels to my local system. I mean, if I was doing some more HPC/numerical programming (say, similarity search on GPUs :-) ), I could see just taking the hit and spending $15,000 on a workstation with an RTX Pro 6000.

For grins:

Max t/s for this and smaller models? RTX 5090 system. Barely squeezing in for $5,000 today and given ram prices, maybe not actually possible tomorrow.

Max CUDA compatibility, slower t/s? DGX Spark.

Ok with slower t/s, don't care so much about CUDA, and want to run larger models? Strix Halo system with 128gb unified memory, order a framework desktop.

Prefer Macs, might run larger models? M3 Ultra with memory maxed out. Better memory bandwidth speed, mac users seem to be quite happy running locally for just messing around.

You'll probably find better answers heading off to https://www.reddit.com/r/LocalLLaMA/ for actual benchmarks.

▲ kpw94 14 hours ago | parent [-]

> I've personally decided to just rent systems with GPUs from a cloud provider and setup SSH tunnels to my local system.

That's a good idea!

Curious about this, if you don't mind sharing:

- what's the stack ? (Do you run like llama.cpp on that rented machine?)

- what model(s) do you run there?

- what's your rough monthly cost? (Does it come up much cheaper than if you called the equivalent paid APIs)

▲ clusterhacks 12 hours ago | parent [-]

I ran ollama first because it was easy, but now download source and build llama.cpp on the machine. I don't bother saving a file system between runs on the rented machine, I build llama.cpp every time I start up.

I am usually just running gpt-oss-120b or one of the qwen models. Sometimes gemma? These are mostly "medium" sized in terms of memory requirements - I'm usually trying unquantized models that will easily run on an single 80-ish gb gpu because those are cheap.

I tend to spend $10-$20 a week. But I am almost always prototyping or testing an idea for a specific project that doesn't require me to run 8 hrs/day. I don't use the paid APIs for several reasons but cost-effectiveness is not one of those reasons.

▲ Juminuvi 9 hours ago | parent | next [-]

I know you say you don't use the paid apis, but renting a gpu is something I've been thinking about and I'd be really interested in knowing how this compares with paying by the token. I think gpt-oss-120b is 0.10/input 0.60/output per million tokens in azure. In my head this could go a long way but I haven't used gpt oss agentically long enough to really understand usage. Just wondering if you know/be willing to share your typical usage/token spend on that dedicated hardware?

▲ KronisLV 38 minutes ago | parent [-]

For comparison, here's my own usage with various cloud models for development:

  * Claude in December: 91 million tokens in, 750k out
  * Codex in December: 43 million tokens in, 351k out
  * Cerebras in December: 41 million tokens in, 301k out
  * (obviously those figures above are so far in the month only)
  * Claude in November: 196 million tokens in, 1.8 million out
  * Codex in November: 214 million tokens in, 4 million out
  * Cerebras in November: 131 million tokens in, 1.6 million out
  * Claude in October: 5 million tokens in, 79k out
  * Codex in October: 119 million tokens in, 3.1 million out

As for Cerebras in October, I don't have the data because they don't show the Qwen3 Coder model that was deprecated, but it was way more: https://blog.kronis.dev/blog/i-blew-through-24-million-token...

In general, I'd say that for the stuff I do my workloads are extremely read heavy (referencing existing code, patterns, tests, build and check script output, implementation plans, docs etc.), but it goes about like this:

  * most fixed cloud subscriptions will run out really quickly and will be insufficient (Cerebras being an exception)
  * if paying per token, you *really* want the provider to support proper caching, otherwise you'll go broke
  * if you have local hardware that is great, but it will *never* compete with the cloud models, so your best bet is to run something good enough, basically cover all of your autocomplete needs, and also with tools like KiloCode an advanced cloud model can do the planning and a simpler local model do the implementation, then the cloud model validate the output

▲ bigiain 10 hours ago | parent | prev [-]

I don't suppose you have (or would be interested in writing) a blog post about how you set that up? Or maybe a list of links/resources/prompts you used to learn how to get there?

▲ clusterhacks 9 hours ago | parent [-]

No, I don't blog. But I just followed the docs for starting an instance on lambda.ai and the llama.cpp build instructions. Both are pretty good resources. I had already setup an SSH key with lambda and the lambda OS images are linux pre-loaded with CUDA libraries on startup.

Here are my lazy notes + a snippet of the history file from the remote instance for a recent setup where I used the web chat interface built into llama.cpp.

I created an instance gpu_1x_gh200 (96 GB on ARM) at lambda.ai.

connected from terminal on my box at home and setup the ssh tunnel.

ssh -L 22434:127.0.0.1:11434 ubuntu@<ip address of rented machine - can see it on lambda.ai console or dashboard>

  Started building llama.cpp from source, history:    
     21  git clone   https://github.com/ggml-org/llama.cpp
     22  cd llama.cpp
     23  which cmake
     24  sudo apt list | grep libcurl
     25  sudo apt-get install libcurl4-openssl-dev
     26  cmake -B build -DGGML_CUDA=ON
     27  cmake --build build --config Release

MISTAKE on 27, SINGLE-THREADED and slow to build see -j 16 below for faster build

     28  cmake --build build --config Release -j 16
     29  ls
     30  ls build
     31  find . -name "llama.server"
     32  find . -name "llama"
     33  ls build/bin/
     34  cd build/bin/
     35  ls
     36  ./llama-server -hf ggml-org/gpt-oss-120b-GGUF -c 0 --jinja

MISTAKE, didn't specify the port number for the llama-server

     37  clear;history
     38  ./llama-server -hf Qwen/Qwen3-VL-30B-A3B-Thinking -c 0 --jinja --port 11434
     39  ./llama-server -hf Qwen/Qwen3-VL-30B-A3B-Thinking.gguf -c 0 --jinja --port 11434
     40  ./llama-server -hf Qwen/Qwen3-VL-30B-A3B-Thinking-GGUF -c 0 --jinja --port 11434
     41  clear;history

I switched to qwen3 vl because I need a multimodal model for that day's experiment. Lines 38 and 39 show me not using the right name for the model. I like how llama.cpp can download and run models directly off of huggingface.

Then pointed my browser at http//:localhost:22434 on my local box and had the normal browser window where I could upload files and use the chat interface with the model. That also gives you an openai api-compatible endpoint. It was all I needed for what I was doing that day. I spent a grand total of $4 that day doing the setup and running some NLP-oriented prompts for a few hours.

	▲	bigiain 4 hours ago \| parent [-]
		Thanks, much appreciated.

▲ tgtweak 13 hours ago | parent | prev | next [-]

dual 3090's (24GB each) on 8x+8x pcie has been a really reliable setup for me (with nvlink bridge... even though it's relatively low bandwidth compared to tesla nvlink, it's better than going over pcie!)

48GB of vram and lots of cuda cores, hard to beat this value atm.

If you want to go even further, you can get an 8x V100 32GB server complete with 512GB ram and nvlink switching for $7000 USD from unixsurplus (ebay.com/itm/146589457908) which can run even bigger models and with healthy throughput. You would need 240V power to run that in a home lab environment though.

	▲	lostmsu 12 hours ago \| parent [-]
		V100 is outdated (no bf16, dropped in CUDA 13) and power hungry (8 cards 3 years continuous use are about $12k of electricity).

▲ monster_truck 16 hours ago | parent | prev [-]

I'd throw a 7900xtx in an AM4 rig with 128gb of ddr4 (which is what I've been using for the past two years)

Fuck nvidia

▲

clusterhacks 15 hours ago | parent | next [-]

You know, I haven't even been thinking about those AMD gpus for local llms and it is clearly a blind spot for me.

How is it? I'd guess a bunch of the MoE models actually run well?

	▲	stusmall 12 hours ago \| parent [-]
		I've been running local models on an AMD 7800 XT with ollama-rocm. I've had zero technical issues. It's really just the usefulness of a model with only 16GB vram + 64GB of main RAM is questionable, but that isn't an AMD specific issue. It was a similar experience running locally with an nvidia card.

▲

androiddrew 16 hours ago | parent | prev [-]

Get a Radeon AI Pro r9700! 32GB of RAM