Remix.run Logo
erelong 5 hours ago

What kind of hardware does HN recommend or like to run these models?

suprjami 5 hours ago | parent | next [-]

The cheapest option is two 3060 12G cards. You'll be able to fit the Q4 of the 27B or 35B with an okay context window.

If you want to spend twice as much for more speed, get a 3090/4090/5090.

If you want long context, get two of them.

If you have enough spare cash to buy a car, get an RTX Ada with 96G VRAM.

barrkel 5 hours ago | parent [-]

Rtx 6000 pro Blackwell, not ada, for 96GB.

dajonker 5 hours ago | parent | prev | next [-]

Radeon R9700 with 32 GB VRAM is relatively affordable for the amount of RAM and with llama.cpp it runs fast enough for most things. These are workstation cards with blower fans and they are LOUD. Otherwise if you have the money to burn get a 5090 for speeeed and relatively low noise, especially if you limit power usage.

cyberax 2 hours ago | parent [-]

I have a pair of Radeon AI PRO R9700 with 32Gb, and so far they have been a pleasure to use. Drivers work out-of-the-box, and they are completely quiet when unused. They are capped at 300W power, so even at 100% utilization they are not too loud.

I was thinking about adding after-market liquid cooling for them, but they're fine without it.

andsoitis 5 hours ago | parent | prev | next [-]

For fast inference, you’d be hard pressed to beat an Nvidia RTX 5090 GPU.

Check out the HP Omen 45L Max: https://www.hp.com/us-en/shop/pdp/omen-max-45l-gaming-dt-gt2...

laweijfmvo 5 hours ago | parent [-]

I never would have guessed that in 2026, data centers would be measured in Watts and desktop PCs measured in liters.

andsoitis 4 hours ago | parent [-]

The Omen was neigh.

zozbot234 5 hours ago | parent | prev | next [-]

It depends. How much are you willing to wait for an answer? Also, how far are you willing to push quantization, given the risk of degraded answers at more extreme quantization levels?

xienze 5 hours ago | parent | prev | next [-]

It's less than you'd think. I'm using the 35B-A3B model on an A5000, which is something like a slightly faster 3080 with 24GB VRAM. I'm able to fit the entire Q4 model in memory with 128K context (and I think I would probably be able to do 256K since I still have like 4GB of VRAM free). The prompt processing is something like 1K tokens/second and generates around 100 tokens/second. Plenty fast for agentic use via Opencode.

rahimnathwani 5 hours ago | parent | next [-]

There seem to be a lot of different Q4s of this model: https://www.reddit.com/r/LocalLLaMA/s/kHUnFWZXom

I'm curious which one you're using.

suprjami 5 hours ago | parent [-]

Unsloth Dynamic. Don't bother with anything else.

rahimnathwani 5 hours ago | parent [-]

UD-Q4_K_XL?

msuniverse2026 5 hours ago | parent | prev [-]

I've had an AMD card for the last 5 years, so I kinda just tuned out of local LLM releases because AMD seemed to abandon rocm for my card (6900xt) - Is AMD capable of anything these days?

pja 4 hours ago | parent | next [-]

> I've had an AMD card for the last 5 years, so I kinda just tuned out of local LLM releases because AMD seemed to abandon rocm for my card (6900xt) - Is AMD capable of anything these days?

Sure. Llama.cpp will happily run these kinds of LLMs using either HIP or Vulcan.

Vulkan is easier to get going using the Mesa OSS drivers under Linux, HIP might give you slightly better performance.

wirybeige 5 hours ago | parent | prev [-]

The vulkan backend for llama.cpp isn't that far behind rocm for pp and tp speeds

elorant 5 hours ago | parent | prev | next [-]

Macs or a strix halo. Unless you want to go lower than 8-bit quantization where any GPU with 24GBs of VRAM would probably run it.

CamperBob2 5 hours ago | parent | prev [-]

I think the 27B dense model at full precision and 122B MoE at 4- or 6-bit quantization are legitimate killer apps for the 96 GB RTX 6000 Pro Blackwell, if the budget supports it.

I imagine any 24 GB card can run the lower quants at a reasonable rate, though, and those are still very good models.

Big fan of Qwen 3.5. It actually delivers on some of the hype that the previous wave of open models never lived up to.

MarsIronPI 5 hours ago | parent [-]

I've had good experience with GLM-4.7 and GLM-5.0. How would you compare them with Qwen 3.5? (If you have any experience with them.)

CamperBob2 3 hours ago | parent [-]

No experience with 5 and not much with 4.7, but they both have quite a few advocates over on /r/localllama.

Unsloth's GLM-4.7-Flash-BF16.gguf is quite fast on the 6000, at around 100 t/s, but definitely not as smart as the Qwen 3.5 MoE or dense models of similar size. As far as I'm concerned Qwen 3.5 renders most other open models short of perhaps Kimi 2.5 obsolete for general queries, although other models are still said to be better for local agentic use. That, I haven't tried.