Sunday morning, and I find myself wondering how the engineering tinkerer is supposed to best self-host these models? I'd love to load this up on the old 2080ti with 128gb of vram and play, even slowly. I'm curious what the current recommendation on that path looks like.

Constraints are the fun part here. I know this isn't the 8x Blackwell Lamborghini, that's the point. :)

▲

giobox 4 days ago | parent | next [-]

If you just want to get something running locally as fast as possible to play with (the 2080ti typically had 11gb of VRAM which will be one of the main limiting factors), the ollama app will run most of these models locally with minimum user effort:

https://ollama.com/

If you really do have a 2080ti with 128gb of VRAM, we'd love to hear more about how you did it!

▲

jlokier 4 days ago | parent | prev | next [-]

I use a Macbook Pro with 128GB RAM "unified memory" that's available to both CPU and GPU.

It's slower than a rented Nvidia GPU, but usable for all the models I've tried (even gpt-oss-120b), and works well in a coffee shop on battery and with no internet connection.

I use Ollama to run the models, so can't run the latest until they are ported to the Ollama library. But I don't have much time for tinkering anyway, so I don't mind the publishing delay.

▲

anon373839 4 days ago | parent | next [-]

I’d strongly advise ditching Ollama for LM Studio, and using MLX versions of the models. They run quite a bit faster on Apple Silicon. Also, LM Studio is much more polished and feature rich than Ollama.

	▲	terhechte 4 days ago \| parent [-]
		Fully agree to this. LM Studio is much nicer to use and with MLX faster on Apple Silicon

▲

MaxMatti 4 days ago | parent | prev [-]

How's the battery holding up during vibe coding sessions or occasional LLM usage? I've been thinking about getting a MacBook or a laptop with a similar Ryzen chip specifically for that reason.

▲

btbuildem 4 days ago | parent | prev | next [-]

I've recently put together a setup that seemed reasonable for my limited budget. Mind you, most of the components were second-hand, open box deals, or deep discount of the moment.

This comfortably fits FP8 quantized 30B models that seem to be "top of the line for hobbyists" grade across the board.

- Ryzen 9 9950X

- MSI MPG X670E Carbon

- 96GB RAM

- 2x RTX 3090 (24GB VRAM each)

- 1600W PSU

▲

nine_k 4 days ago | parent | next [-]

Does it offer more performance than a Macbook Pro that could be had for a comparable sum? Your build can be had for under $3k; a used MBP M3 with 64 GB RAM can be had for approximately $3.5k.

	▲	btbuildem 4 days ago \| parent \| next [-]
		I'm not sure, I did not run any benchmarks. As a ballpark figure -- with both cards throttled down to 250W, running a Qwen-30B FP8 model (variant depending on task), I get upwards of 60 tok/sec. It feels on par with the premium models, tbh. Of course this is in a single-user environment, with vLLM keeping the model warm.
	▲	bee_rider 4 days ago \| parent \| prev [-]
		MacBooks have some clever chips, but 2x 3090 is a lot of brawn to overcome.

▲

pstuart 4 days ago | parent | prev | next [-]

That's basically what I imagined would be my rig if I were to pull the trigger. Do you have an NVLink adapter as well?

	▲	btbuildem 4 days ago \| parent [-]
		No NVLink; it took me a long time to compose the exact hardware specs, because I wanted to optimize performance. Both cards are on x8 PCIe direct CPU channels, close to their max throughput anyway. It runs hot with the CPU engaged, but it runs fast.

▲

PeterStuer 4 days ago | parent | prev [-]

Unfortunately the RTX 3090 has no native FP8 support.

▲

jwr 4 days ago | parent | prev | next [-]

I just use my laptop. A modern MacBook Pro will run ~30B models very well. I normally stick to "Max" CPUs (initially for more performance cores, recently also for the GPU power) with 64GB of RAM. My next update will probably be to 128GB of RAM, because 64GB doesn't quite cut it if you want to run large Docker containers and LLMs.

▲

Lapel2742 4 days ago | parent | prev | next [-]

> I'd love to load this up on the old 2080ti with 128gb of vram and play, even slowly.

I think you mean ram and no vram. AFAIK this is a 30b moe model with 3b active parameters. Comparable to the Qwen3 MOE model. If you do not expect 60 tps such models should run sufficiently fast.

I run the Qwen3 MOE Model (https://huggingface.co/unsloth/Qwen3-30B-A3B-GGUF/blob/main/...) in 4-bit quantization on an 11 year old i5-6600 (32GB) and a Radeon 6600 with 8GB. According to a quick search your card is faster than that and I get ~12 tps with 16k context on Llama.cpp, which is ok for playing around.

My Radeon (ROCm) specific batch file to start this:

llama-server --ctx-size 16384 --flash-attn on --cache-type-k q8_0 --cache-type-v q8_0 --device ROCm0 -ngl -1 --model /usr/local/share/gguf/Qwen3-30B-A3B-Q4_0.gguf --cache-ram 16384 --cpu-moe --numa distribute --override-tensor "\.ffn_.*_exps\.weight=CPU" --jinja --temp 0.7 --port 8080

▲

exe34 4 days ago | parent | prev | next [-]

llama.cpp + quantized: https://huggingface.co/bartowski/Alibaba-NLP_Tongyi-DeepRese...

get the biggest one that will fit in your vram.

▲

trebligdivad 4 days ago | parent | next [-]

How do people deal with all the different quantisations? Generally if I see an Unsloth I'm happy to try it locally; random other peoples...how do I know what I'm getting?

(If nothing else Tongyi are currently winning AI with cutest logo)

	▲	exe34 4 days ago \| parent [-]
		personally I've only used them for toying around - but in all cases you have to test them for your use case anyway.

▲

davidsainez 4 days ago | parent | prev [-]

This is the way. I managed to run (super) tiny models on CPU only with this approach.

▲

greggh 3 days ago | parent | prev | next [-]

If you really need a lot of VRAM cheap rocm still supports the amd MI50 and you can get 32gb versions of the MI50 on alibaba/aliexpress for around $150-$250 each. A few people on r/localllama have shown setups with multiple MI50s running with 128gb of VRAM and doing a decent job with large models. Obviously it won't running as fast as any brand new GPUs because of memory bandwidth and a few other things, but more than fast enough to be usable.

This can end up getting you 128gb of VRAM for under $1000.

▲

homarp 4 days ago | parent | prev | next [-]

llama.cpp gives you the most control to tune it for your machine.

▲

CuriousSkeptic 4 days ago | parent | prev | next [-]

Im sure this guy has some helpful hints on that: https://youtube.com/@azisk

▲

sumo43 4 days ago | parent | prev | next [-]

Try running this using their harness https://huggingface.co/flashresearch/FlashResearch-4B-Thinki...

▲

aliljet 4 days ago | parent | prev | next [-]

oh my god. 128 gb of RAM! way too late to repair this thread, but most people caught this.

▲

sigmarule 4 days ago | parent | prev | next [-]

The Framework Desktop runs this perfectly well, and for just about $2k.

▲

3abiton 4 days ago | parent | prev [-]

As many pointed out, Macs are decent enough to run them (with maxxed rams). You also have more alternative, like DGX Sparks (if you appreciate the ease of cuda, albeit a tad bit slower token generation performance), or the Strix Halo (good luck with ROCm though, AMD still peddling hype). There is no straitghtforwars "cheap" answer. You either go big (gpu server), or compromise. Either way use either vllm or sglang, or llama.cpp. ollama is just inferior in every way to llama.cpp.