| ▲ | erelong 5 hours ago | ||||||||||||||||||||||||||||||||||||||||||||||
What kind of hardware does HN recommend or like to run these models? | |||||||||||||||||||||||||||||||||||||||||||||||
| ▲ | suprjami 5 hours ago | parent | next [-] | ||||||||||||||||||||||||||||||||||||||||||||||
The cheapest option is two 3060 12G cards. You'll be able to fit the Q4 of the 27B or 35B with an okay context window. If you want to spend twice as much for more speed, get a 3090/4090/5090. If you want long context, get two of them. If you have enough spare cash to buy a car, get an RTX Ada with 96G VRAM. | |||||||||||||||||||||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||||||||||||||||||||
| ▲ | dajonker 5 hours ago | parent | prev | next [-] | ||||||||||||||||||||||||||||||||||||||||||||||
Radeon R9700 with 32 GB VRAM is relatively affordable for the amount of RAM and with llama.cpp it runs fast enough for most things. These are workstation cards with blower fans and they are LOUD. Otherwise if you have the money to burn get a 5090 for speeeed and relatively low noise, especially if you limit power usage. | |||||||||||||||||||||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||||||||||||||||||||
| ▲ | andsoitis 5 hours ago | parent | prev | next [-] | ||||||||||||||||||||||||||||||||||||||||||||||
For fast inference, you’d be hard pressed to beat an Nvidia RTX 5090 GPU. Check out the HP Omen 45L Max: https://www.hp.com/us-en/shop/pdp/omen-max-45l-gaming-dt-gt2... | |||||||||||||||||||||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||||||||||||||||||||
| ▲ | zozbot234 5 hours ago | parent | prev | next [-] | ||||||||||||||||||||||||||||||||||||||||||||||
It depends. How much are you willing to wait for an answer? Also, how far are you willing to push quantization, given the risk of degraded answers at more extreme quantization levels? | |||||||||||||||||||||||||||||||||||||||||||||||
| ▲ | xienze 5 hours ago | parent | prev | next [-] | ||||||||||||||||||||||||||||||||||||||||||||||
It's less than you'd think. I'm using the 35B-A3B model on an A5000, which is something like a slightly faster 3080 with 24GB VRAM. I'm able to fit the entire Q4 model in memory with 128K context (and I think I would probably be able to do 256K since I still have like 4GB of VRAM free). The prompt processing is something like 1K tokens/second and generates around 100 tokens/second. Plenty fast for agentic use via Opencode. | |||||||||||||||||||||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||||||||||||||||||||
| ▲ | elorant 5 hours ago | parent | prev | next [-] | ||||||||||||||||||||||||||||||||||||||||||||||
Macs or a strix halo. Unless you want to go lower than 8-bit quantization where any GPU with 24GBs of VRAM would probably run it. | |||||||||||||||||||||||||||||||||||||||||||||||
| ▲ | CamperBob2 5 hours ago | parent | prev [-] | ||||||||||||||||||||||||||||||||||||||||||||||
I think the 27B dense model at full precision and 122B MoE at 4- or 6-bit quantization are legitimate killer apps for the 96 GB RTX 6000 Pro Blackwell, if the budget supports it. I imagine any 24 GB card can run the lower quants at a reasonable rate, though, and those are still very good models. Big fan of Qwen 3.5. It actually delivers on some of the hype that the previous wave of open models never lived up to. | |||||||||||||||||||||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||||||||||||||||||||