| ▲ | aliljet 4 days ago | |||||||||||||||||||||||||||||||||||||||||||
Sunday morning, and I find myself wondering how the engineering tinkerer is supposed to best self-host these models? I'd love to load this up on the old 2080ti with 128gb of vram and play, even slowly. I'm curious what the current recommendation on that path looks like. Constraints are the fun part here. I know this isn't the 8x Blackwell Lamborghini, that's the point. :) | ||||||||||||||||||||||||||||||||||||||||||||
| ▲ | giobox 4 days ago | parent | next [-] | |||||||||||||||||||||||||||||||||||||||||||
If you just want to get something running locally as fast as possible to play with (the 2080ti typically had 11gb of VRAM which will be one of the main limiting factors), the ollama app will run most of these models locally with minimum user effort: If you really do have a 2080ti with 128gb of VRAM, we'd love to hear more about how you did it! | ||||||||||||||||||||||||||||||||||||||||||||
| ▲ | jlokier 4 days ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||||||||||||||
I use a Macbook Pro with 128GB RAM "unified memory" that's available to both CPU and GPU. It's slower than a rented Nvidia GPU, but usable for all the models I've tried (even gpt-oss-120b), and works well in a coffee shop on battery and with no internet connection. I use Ollama to run the models, so can't run the latest until they are ported to the Ollama library. But I don't have much time for tinkering anyway, so I don't mind the publishing delay. | ||||||||||||||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||||||||||||||
| ▲ | btbuildem 4 days ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||||||||||||||
I've recently put together a setup that seemed reasonable for my limited budget. Mind you, most of the components were second-hand, open box deals, or deep discount of the moment. This comfortably fits FP8 quantized 30B models that seem to be "top of the line for hobbyists" grade across the board. - Ryzen 9 9950X - MSI MPG X670E Carbon - 96GB RAM - 2x RTX 3090 (24GB VRAM each) - 1600W PSU | ||||||||||||||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||||||||||||||
| ▲ | jwr 4 days ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||||||||||||||
I just use my laptop. A modern MacBook Pro will run ~30B models very well. I normally stick to "Max" CPUs (initially for more performance cores, recently also for the GPU power) with 64GB of RAM. My next update will probably be to 128GB of RAM, because 64GB doesn't quite cut it if you want to run large Docker containers and LLMs. | ||||||||||||||||||||||||||||||||||||||||||||
| ▲ | Lapel2742 4 days ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||||||||||||||
> I'd love to load this up on the old 2080ti with 128gb of vram and play, even slowly. I think you mean ram and no vram. AFAIK this is a 30b moe model with 3b active parameters. Comparable to the Qwen3 MOE model. If you do not expect 60 tps such models should run sufficiently fast. I run the Qwen3 MOE Model (https://huggingface.co/unsloth/Qwen3-30B-A3B-GGUF/blob/main/...) in 4-bit quantization on an 11 year old i5-6600 (32GB) and a Radeon 6600 with 8GB. According to a quick search your card is faster than that and I get ~12 tps with 16k context on Llama.cpp, which is ok for playing around. My Radeon (ROCm) specific batch file to start this: llama-server --ctx-size 16384 --flash-attn on --cache-type-k q8_0 --cache-type-v q8_0 --device ROCm0 -ngl -1 --model /usr/local/share/gguf/Qwen3-30B-A3B-Q4_0.gguf --cache-ram 16384 --cpu-moe --numa distribute --override-tensor "\.ffn_.*_exps\.weight=CPU" --jinja --temp 0.7 --port 8080 | ||||||||||||||||||||||||||||||||||||||||||||
| ▲ | exe34 4 days ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||||||||||||||
llama.cpp + quantized: https://huggingface.co/bartowski/Alibaba-NLP_Tongyi-DeepRese... get the biggest one that will fit in your vram. | ||||||||||||||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||||||||||||||
| ▲ | greggh 3 days ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||||||||||||||
If you really need a lot of VRAM cheap rocm still supports the amd MI50 and you can get 32gb versions of the MI50 on alibaba/aliexpress for around $150-$250 each. A few people on r/localllama have shown setups with multiple MI50s running with 128gb of VRAM and doing a decent job with large models. Obviously it won't running as fast as any brand new GPUs because of memory bandwidth and a few other things, but more than fast enough to be usable. This can end up getting you 128gb of VRAM for under $1000. | ||||||||||||||||||||||||||||||||||||||||||||
| ▲ | homarp 4 days ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||||||||||||||
llama.cpp gives you the most control to tune it for your machine. | ||||||||||||||||||||||||||||||||||||||||||||
| ▲ | CuriousSkeptic 4 days ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||||||||||||||
Im sure this guy has some helpful hints on that: https://youtube.com/@azisk | ||||||||||||||||||||||||||||||||||||||||||||
| ▲ | sumo43 4 days ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||||||||||||||
Try running this using their harness https://huggingface.co/flashresearch/FlashResearch-4B-Thinki... | ||||||||||||||||||||||||||||||||||||||||||||
| ▲ | aliljet 4 days ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||||||||||||||
oh my god. 128 gb of RAM! way too late to repair this thread, but most people caught this. | ||||||||||||||||||||||||||||||||||||||||||||
| ▲ | sigmarule 4 days ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||||||||||||||
The Framework Desktop runs this perfectly well, and for just about $2k. | ||||||||||||||||||||||||||||||||||||||||||||
| ▲ | 3abiton 4 days ago | parent | prev [-] | |||||||||||||||||||||||||||||||||||||||||||
As many pointed out, Macs are decent enough to run them (with maxxed rams). You also have more alternative, like DGX Sparks (if you appreciate the ease of cuda, albeit a tad bit slower token generation performance), or the Strix Halo (good luck with ROCm though, AMD still peddling hype). There is no straitghtforwars "cheap" answer. You either go big (gpu server), or compromise. Either way use either vllm or sglang, or llama.cpp. ollama is just inferior in every way to llama.cpp. | ||||||||||||||||||||||||||||||||||||||||||||