| ▲ | kombine 7 hours ago | ||||||||||||||||||||||
What kind of hardware (preferably non-Apple) can run this model? What about 122B? | |||||||||||||||||||||||
| ▲ | daemonologist 7 hours ago | parent | next [-] | ||||||||||||||||||||||
The 3B active is small enough that it's decently fast even with experts offloaded to system memory. Any PC with a modern (>=8 GB) GPU and sufficient system memory (at least ~24 GB) will be able to run it okay; I'm pretty happy with just a 7800 XT and DDR4. If you want faster inference you could probably squeeze it into a 24 GB GPU (3090/4090 or 7900 XTX) but 32 GB would be a lot more comfortable (5090 or Radeon Pro). 122B is a more difficult proposition. (Also, keep in mind the 3.6 122B hasn't been released yet and might never be.) With 10B active parameters offloading will be slower - you'd probably want at least 4 channels of DDR5, or 3x 32GB GPUs, or a very expensive Nvidia Pro 6000 Blackwell. | |||||||||||||||||||||||
| ▲ | ru552 7 hours ago | parent | prev | next [-] | ||||||||||||||||||||||
You won't like it, but the answer is Apple. The reason is the unified memory. The GPU can access all 32gb, 64gb, 128gb, 256gb, etc. of RAM. An easy way (napkin math) to know if you can run a model based on it's parameter size is to consider the parameter size as GB that need to fit in GPU RAM. 35B model needs atleast 35gb of GPU RAM. This is a very simplified way of looking at it and YES, someone is going to say you can offload to CPU, but no one wants to wait 5 seconds for 1 token. | |||||||||||||||||||||||
| |||||||||||||||||||||||
| ▲ | terramex 7 hours ago | parent | prev | next [-] | ||||||||||||||||||||||
I run Gemma 4 26B-A4B with 256k context (maximum) on Radeon 9070XT 16GB VRAM + 64GB RAM with partial GPU offload (with recommended LMStudio settings) at very reasonable 35 tokens per second, this model is similiar in size so I expect similar performance. | |||||||||||||||||||||||
| ▲ | mildred593 7 hours ago | parent | prev | next [-] | ||||||||||||||||||||||
I can run this on an AMD Framework laptop. A Ryzen 7 (I dont have Ryzen AI, just Ryzen 7 7840U) with 32+48 GB DDR. The Ryzen unified memory is enough, I get 26GB of VRAM at least. Fedora 43 and LM Studio with Vulkan llama.cpp | |||||||||||||||||||||||
| ▲ | rhdunn 7 hours ago | parent | prev | next [-] | ||||||||||||||||||||||
The Q5 quantization (26.6GB) should easily run on a 32GB 5090. The Q4 (22.4GB) should fit on a 24GB 4090, but you may need to drop it down to Q3 (16.8GB) when factoring in the context. You can also run those on smaller cards by configuring the number of layers on the GPU. That should allow you to run the Q4/Q5 version on a 4090, or on older cards. You could also run it entirely on the CPU/in RAM if you have 32GB (or ideally 64GB) of RAM. The more you run in RAM the slower the inference. | |||||||||||||||||||||||
| |||||||||||||||||||||||
| ▲ | canpan 7 hours ago | parent | prev | next [-] | ||||||||||||||||||||||
Any good gaming pc can run the 35b-a3 model. Llama cpp with ram offloading. A high end gaming PC can run it at higher speeds. For your 122b, you need a lot of memory, which is expensive now. And it will be much slower as you need to use mostly system ram. | |||||||||||||||||||||||
| |||||||||||||||||||||||
| ▲ | bildung 6 hours ago | parent | prev [-] | ||||||||||||||||||||||
I currently run the qwen3.5-122B (Q4) on a Strix Halo (Bosgame M5) and am pretty happy with it. Obviously much slower than hosted models. I get ~ 20t/s with empty context and am down to about 14t/s with 100k of context filled. No tuning at all, just apt install rocm and rebuilding llama.cpp every week or so. | |||||||||||||||||||||||