| ▲ | Aurornis 3 hours ago |
| The 27B model they release directly would require significant hardware to run natively at 16-bit: A Mac or Strix Halo 128GB system, multiple high memory consumer GPUs, or an RTX 6000 workstation card. This is why they don’t advertise which consumer hardware it can run on: Their direct release that delivers these results cannot fit on your average consumer system. Most consumers don’t run the model they release directly. They run a quantized model that uses a lower number of bits per weight. The quantizations come with tradeoffs. You will not get the exact results they advertise using a quantized version, but you can fit it on smaller hardware. The previous 27B Qwen3.5 model had reasonable performance down to Q5 or Q4 depending on your threshold for quality loss. This was usable on a unified memory system (Mac, Strix Halo) with 32GB of extra RAM, so generally a 64GB Mac. They could also be run on an nVidia 5090 with 32GB RAM or a pair of 16GB or 24GB GPUs, which would not run as fast due to the split. Watch out for some of the claims about running these models on iPhones or smaller systems. You can use a lot of tricks and heavy quantization to run it on very small systems but the quality of output will not be usable. There is a trend of posting “I ran this model and this small hardware” repos for social media bragging rights but the output isn’t actually good. |
|
| ▲ | ryandrake 2 hours ago | parent | next [-] |
| Yea, this is currently the confusing part of running local models for newbies: Even after you have decided which model you want to run, and which org's quantizations to use (let's just assume Unsloth's for example), there are often dozens of quantizations offered, and choosing among them is confusing. Say you have a GPU with 20GB of VRAM. You're probably going to be able to run all the 3-bit quantizations with no problem, but which one do you choose? Unsloth offers[1] four of them: UD-IQ3_XXS, Q3_K_S, Q3_K_M, UD-Q3_K_XL. Will they differ significantly? What are each of them good at? The 4-bit quantizations will be a "tight squeeze" on your 20GB GPU. Again, Unsloth steps up to the plate with seven(!!) choices: IQ4_XS, Q4_K_S, IQ4_NL, Q4_0, Q4_1, Q4_K_M, UD-Q4_K_XL. Holy shit where do I even begin? You can try each of them to see what fits on your GPU, but that's a lot of downloading, and then... Once you [guess and] commit to one of the quantizations and do a gigantic download, you're not done fiddling. You need to decide at the very least how big a context window you need, and this is going to be trial and error. Choose a value, try to load the model, if it fails, you chose too large. Rinse and repeat. Then finally, you're still not done. Don't forget the parameters: temperature, top_p, top_k, and so on. It's bewildering! 1: https://huggingface.co/unsloth/Qwen3.6-27B-GGUF |
| |
| ▲ | danielhanchen 2 hours ago | parent | next [-] | | We made Unsloth Studio which should help :) 1. Auto best official parameters set for all models 2. Auto determines the largest quant that can fit on your PC / Mac etc 3. Auto determines max context length 4. Auto heals tool calls, provides python & bash + web search :) | | |
| ▲ | cyanydeez 2 minutes ago | parent | next [-] | | Is unsloth working on managing remote servers, like how vscode integrates with a remote server via ssh? | |
| ▲ | ryandrake an hour ago | parent | prev | next [-] | | Yea, I actually tried it out last time we had one of these threads. It's undeniably easy to use, but it is also very opinionated about things like the directory locations/layouts for various assets. I don't think I managed to get it to work with a simple flat directory full of pre-downloaded models on an NFS mount to my NAS. It also insists on re-downloading a 3GB model every time it is launches, even after I delete the model file. I probably have to just sit down and do some Googleing/searching in order to rein the software in and get it to work the way I want it to on my system. | |
| ▲ | hypercube33 23 minutes ago | parent | prev | next [-] | | Sadly doesn't support fine tuning on AMD yet which gave me a sad since I wanted to cut one of these down to be specific domain experts. Also running the studio is a bit of a nightmare when it calls diskpart during its install (why?) | |
| ▲ | jbellis 14 minutes ago | parent | prev | next [-] | | what are you using for web search? | |
| ▲ | wuschel an hour ago | parent | prev [-] | | Great project! Thank you for that! |
| |
| ▲ | Aurornis 2 hours ago | parent | prev [-] | | > Say you have a GPU with 20GB of VRAM. You're probably going to be able to run all the 3-bit quantizations with no problem, but which one do you choose? Unsloth offers[1] four of them: UD-IQ3_XXS, Q3_K_S, Q3_K_M, UD-Q3_K_XL There are actually two problems with this: First, the 3-bit quants are where the quality loss really becomes obvious. You can get it to run, but you’re not getting the quality you expected. The errors compound over longer sessions. Second, you need room for context. If you have become familiar with the long 200K contexts you get with SOTA models, you will not be happy with the minimal context you can fit into a card with 16-20GB of RAM. The challenge for newbies is learning to identify the difference between being able to get a model to run, and being able to run it with useful quality and context. | | |
| ▲ | smallerize 6 minutes ago | parent | next [-] | | I found the KLD benchmark image at the bottom of https://unsloth.ai/docs/models/qwen3.6 to be very helpful when choosing a quant. | |
| ▲ | zargon 14 minutes ago | parent | prev | next [-] | | Qwen3.5 series is a little bit of an exception to the general rule here. It is incredibly kv cache size efficient. I think the max context fits in 3GB at q8 iirc. I prefer to keep the cache at full precision though. | |
| ▲ | ryandrake an hour ago | parent | prev [-] | | Yea, I'm also kind of jealous of Apple folks with their unified RAM. On a traditional homelab setup with gobs of system RAM and a GPU with relatively little VRAM, all that system RAM sits there useless for running LLMs. | | |
| ▲ | zozbot234 an hour ago | parent | next [-] | | That "traditional" setup is the recommended setup for running large MoE models, leaving shared routing layers on the GPU to the extent feasible. You can even go larger-than-system-RAM via mmap, though at a non-trivial cost in throughput. | |
| ▲ | 28 minutes ago | parent | prev | next [-] | | [deleted] | |
| ▲ | khimaros 32 minutes ago | parent | prev [-] | | Strix Halo is another option |
|
|
|
|
| ▲ | ndriscoll 2 hours ago | parent | prev | next [-] |
| Note that you could also run them on AMD (and presumably Intel) dGPUs. e.g. I have a 32GB R9700, which is much cheaper than a 5090, and runs 27B dense models at ~20 t/s (or MoE models with 3-4B active at ~80t/s). I expect an Arc B70 would also work soon if it doesn't already, and would likely be the price/perf sweet spot right now. My R9700 does seem to have an annoying firmware or driver bug[0] that causes the fan to usually be spinning at 100% regardless of temperature, which is very noisy and wastes like 20+ W, but I just moved my main desktop to my basement and use an almost silent N150 minipc as my daily driver now. [0] Or manufacturing defect? I haven't seen anyone discussing it online, but I don't know how many owners are out there. It's a Sapphire fwiw. It does sometimes spin down, the reported temperatures are fine, and IIRC it reports the fan speed as maxed out, so I assume software bug where it's just not obeying the fan curve |
| |
| ▲ | acrispino 12 minutes ago | parent | next [-] | | I have 2x asrock R9700. One of the them was noticeably noisier than the other and eventually developed an annoying vibration while in the middle of its fan curve. Asrock replaced it under RMA. | |
| ▲ | zozbot234 2 hours ago | parent | prev [-] | | Yup, I suppose that these smaller, dense models are in the lead wrt. fast inference with consumer dGPUs (or iGPUs depending on total RAM) with just enough VRAM to contain the full model and context. That won't give you anywhere near SOTA results compared to larger MoE models with a similar amount of active parameters, but it will be quite fast. |
|
|
| ▲ | alex7o 17 minutes ago | parent | prev | next [-] |
| Because when you pay for a subscription they don't silently quantize the model a few week after release, and you can no longer get the full model running. Otherwise no need for full fp16, int8 works 99% as well for half the mem, and the lower you go the more you start to pay for the quants. But int8 is super safe imo. |
|
| ▲ | muyuu 2 hours ago | parent | prev | next [-] |
| i have a Strix Halo machine typically those dense models are too slow on Strix Halo to be practical, expect 5-7 tps you can get an idea by looking at other dense benchmarks here: https://strixhalo.zurkowski.net/experiments - i'd expect this model to be tested here soon, i don't think i will personally bother |
| |
| ▲ | hedgehog 2 hours ago | parent [-] | | This one is around 250 t/s prefill and 12.4 generation in my testing. |
|
|
| ▲ | wuschel an hour ago | parent | prev | next [-] |
| > but the quality of output will not be usable Making the the right pick for model is one of the key problems as a local user. Do you have any references where one can see a mapping of problem query to model response quality? |
|
| ▲ | Oras 2 hours ago | parent | prev | next [-] |
| If these models reach quality of Opus 4.5, then DGX could be a good alternative for serious dev teams to run local models. It is not that expensive and has short time to make ROI |
|
| ▲ | anonym29 an hour ago | parent | prev [-] |
| You absolutely do not need to run at full BF16. The quality loss between BF16 (55.65 GB in GGUF) and Q8_0 (30.44 GB in GGUF) is essentially zero - think on the order of magnitude of +0.01-0.03 perplexity, or ~0.1-0.3% relative PPL increase. The quality loss between BF16 and Q4_K_M (18.66 GB in GGUF) is close to imperceptible, with perplexity changes in the +0.1-0.3 ballpark, or ~1-3% relative PPL increase. This would correlate to a 0-2% drop on downstream tasks like MMLU/GSM8K/HellaSwag: essentially indistinguishable. You absolutely do NOT need a $3000 Strix Halo rig or a $4000 Mac or a $9000 RTX 6000 or "multiple high memory consumer GPUs" to run this model at extremely high accuracy. I say this as a huge Strix Halo fanboy (Beelink GTR 9 Pro), mind you. Where Strix Halo is more necessary (and actually offers much better performance) are larger but sparse MoE models - think Qwen 3.5 122B A10B - which offers the total knowledge (and memory requirements) of a 122B model, with processing and generation speed more akin to a 10B dense model, which is a big deal with the limited MBW we get in the land of Strix Halo (256 GB/s theoretical, ~220 GB/s real-world) and DGX Spark (273 GB/s theoretical - not familiar with real-world numbers specifically off the top of my head). I would make the argument, as a Strix Halo owner, that 27B dense models are actually not particularly pleasant or snappy to run on Strix Halo, and you're much better off with those larger but sparse MoE models with far fewer active parameters on such systems. I'd much rather have an RTX 5090, an Arc B70 Pro, or an AMD AI PRO R9700 (dGPUs with 32GB of GDDR6/7) for 27B dense models specifically. |
| |
| ▲ | zozbot234 an hour ago | parent | next [-] | | I'm all for running large MoE models on unified memory systems, but developers of inference engines should do a better job of figuring out how to run larger-than-total-RAM models on such systems, streaming in sparse weights from SSD but leveraging the large unified memory as cache. This is easily supported with pure-CPU inference via mmap, but there is no obvious equivalent when using the GPU for inference. | | |
| ▲ | anonym29 an hour ago | parent [-] | | I use llama.cpp, and there is a way to do this - some layers to (i)GPU, the rest to CPU. I was just trying this out with Kimi K2.5 (in preparation for trying it out with Kimi K2.6 the other night. Check out the --n-cpu-moe flag in llama.cpp. That said, my Strix Halo rig only has PCIe 4.0 for my NVMe, and I'm using a 990 Evo that had poor sustained random read, being DRAM-less. My effective read speeds from disk were averaging around 1.6-2.0 GB/s, and with unsloth's K2.5, even in IQ2_XXS at "just" 326 GB, with ~64 GB worth of layers in iGPU and the rest free for KV cache + checkpoints. Even still, that was over 250 GB of weights streaming at ~2 GB/s, so I was getting 0.35 PP tok/s and 0.22 TG tok/s. I could go a little faster with a better drive, or a little faster still if I dropping in two of em in raid0, but it would still be on the order of magnitude of sub-1 tok/s PP (compute limited) and TG (bandwidth limited). | | |
| ▲ | adrian_b 6 minutes ago | parent [-] | | In a computer with 2 PCIe 5.0 SSDs or one with a PCIe 5.0 SSDs and a PCIe 4.0 SSD, it should be possible to stream weights from the SSDs at 20 GB/s, or even more. This is not a little faster, but 10 times faster than on your system. So a couple of tokens per second generation speed should be achievable. Nowadays even many NUCs or NUC-like mini-PCs have such SSD slots. I have actually started working at optimizing such an inference system, so your data is helpful for comparison. |
|
| |
| ▲ | p_stuart82 34 minutes ago | parent | prev [-] | | tbh ~1-3% PPL hit from Q4_K_M stopped being the bottleneck a while ago. the bottleneck is the 48 hours of guessing llama.cpp flags and chat template bugs before the ecosystem catches up. you are doing unpaid QA. | | |
| ▲ | anonym29 23 minutes ago | parent [-] | | Just wait a week for model bugs to be worked out. This is well-known advice and a common practice within r/localllama. The flags are not hard at all if you're using llama.cpp regularly. If you're new to the ecosystem, that's closer to a one-time effort with irregular updates than it is to something you have to re-learn for every model. |
|
|