| ▲ | Qwen3.6-27B: Flagship-Level Coding in a 27B Dense Model(qwen.ai) |
| 177 points by mfiguiere 3 hours ago | 84 comments |
| |
|
| ▲ | jameson 18 minutes ago | parent | next [-] |
| What competitive advantage does OpenAI/Anthropic has when companies like Qwen/Minimax/etc are open sourcing models that shows similar (yet below than OpenAI/Anthropic) benchmark results? Also, the token prices of these open source models are at a fraction of Anthropic's Opus 4.6[1] [1]: https://artificialanalysis.ai/models/#pricing |
| |
| ▲ | fnordpiglet 7 minutes ago | parent | next [-] | | For coding often quality at the margin is crucial even at a premium. It’s not the same as cranking out spam emails or HN posts at scale. This is why the marginal difference between your median engineer and your P99 engineer is comp is substantial, while the marginal comp difference between your median pick and packer vs your P99 pick and packer isn’t. I’d also say it keeps the frontier shops competitive while costing R&D in the present is beneficial to them in forcing them to make a better and better product especially in value add space. Finally, particularly for Anthropic, they are going for the more trustworthy shop. Even ali is hosting pay frontier models for service revenue, but if you’re not a Chinese shop, would you really host your production code development workload on a Chinese hosted provider? OpenAI is sketchy enough but even there I have a marginal confidence they aren’t just wholesale mining data for trade secrets - even if they are using it for model training. Anthropic I slightly trust more. Hence the premium. No one really believes at face value a Chinese hosted firm isn’t mass trolling every competitive advantage possible and handing back to the government and other cross competitive firms - even if they aren’t the historical precedent is so well established and known that everyone prices it in. | |
| ▲ | Aurornis 16 minutes ago | parent | prev | next [-] | | I use Opus and the Qwen models. The gap between them is much larger than the benchmark charts show. If you want to compare to a hosted model, look toward the GLM hosted model. It’s closest to the big players right now. They were selling it at very low prices but have started raising the price recently. | |
| ▲ | Frannky 8 minutes ago | parent | prev [-] | | If these results are because of vampire attacks, the results will stop being so good when closed ones figure out how to pollute them when they are sucking answers. Also, they are not exactly as good when you use them in your daily flow; maybe for shallow reasoning but not for coding and more difficult stuff. Or at least I haven't found an open one as good as closed ones; I would love to, if you have some cool settings, please share |
|
|
| ▲ | anonzzzies an hour ago | parent | prev | next [-] |
| I wish that all announcements of models would show what (consumer) hardware you can run this on today, costs and tok/s. |
| |
| ▲ | Aurornis an hour ago | parent | next [-] | | The 27B model they release directly would require significant hardware to run natively at 16-bit: A Mac or Strix Halo 128GB system, multiple high memory consumer GPUs, or an RTX 6000 workstation card. This is why they don’t advertise which consumer hardware it can run on: Their direct release that delivers these results cannot fit on your average consumer system. Most consumers don’t run the model they release directly. They run a quantized model that uses a lower number of bits per weight. The quantizations come with tradeoffs. You will not get the exact results they advertise using a quantized version, but you can fit it on smaller hardware. The previous 27B Qwen3.5 model had reasonable performance down to Q5 or Q4 depending on your threshold for quality loss. This was usable on a unified memory system (Mac, Strix Halo) with 32GB of extra RAM, so generally a 64GB Mac. They could also be run on an nVidia 5090 with 32GB RAM or a pair of 16GB or 24GB GPUs, which would not run as fast due to the split. Watch out for some of the claims about running these models on iPhones or smaller systems. You can use a lot of tricks and heavy quantization to run it on very small systems but the quality of output will not be usable. There is a trend of posting “I ran this model and this small hardware” repos for social media bragging rights but the output isn’t actually good. | | |
| ▲ | ryandrake 24 minutes ago | parent | next [-] | | Yea, this is currently the confusing part of running local models for newbies: Even after you have decided which model you want to run, and which org's quantizations to use (let's just assume Unsloth's for example), there are often dozens of quantizations offered, and choosing among them is confusing. Say you have a GPU with 20GB of VRAM. You're probably going to be able to run all the 3-bit quantizations with no problem, but which one do you choose? Unsloth offers[1] four of them: UD-IQ3_XXS, Q3_K_S, Q3_K_M, UD-Q3_K_XL. Will they differ significantly? What are each of them good at? The 4-bit quantizations will be a "tight squeeze" on your 20GB GPU. Again, Unsloth steps up to the plate with seven(!!) choices: IQ4_XS, Q4_K_S, IQ4_NL, Q4_0, Q4_1, Q4_K_M, UD-Q4_K_XL. Holy shit where do I even begin? You can try each of them to see what fits on your GPU, but that's a lot of downloading, and then... Once you [guess and] commit to one of the quantizations and do a gigantic download, you're not done fiddling. You need to decide at the very least how big a context window you need, and this is going to be trial and error. Choose a value, try to load the model, if it fails, you chose too large. Rinse and repeat. Then finally, you're still not done. Don't forget the parameters: temperature, top_p, top_k, and so on. It's bewildering! 1: https://huggingface.co/unsloth/Qwen3.6-27B-GGUF | | |
| ▲ | danielhanchen 19 minutes ago | parent | next [-] | | We made Unsloth Studio which should help :) 1. Auto best official parameters set for all models 2. Auto determines the largest quant that can fit on your PC / Mac etc 3. Auto determines max context length 4. Auto heals tool calls, provides python & bash + web search :) | |
| ▲ | Aurornis 13 minutes ago | parent | prev [-] | | > Say you have a GPU with 20GB of VRAM. You're probably going to be able to run all the 3-bit quantizations with no problem, but which one do you choose? Unsloth offers[1] four of them: UD-IQ3_XXS, Q3_K_S, Q3_K_M, UD-Q3_K_XL There are actually two problems with this: First, the 3-bit quants are where the quality loss really becomes obvious. You can get it to run, but you’re not getting the quality you expected. The errors compound over longer sessions. Second, you need room for context. If you have become familiar with the long 200K contexts you get with SOTA models, you will not be happy with the minimal context you can fit into a card with 16-20GB of RAM. The challenge for newbies is learning to identify the difference between being able to get a model to run, and being able to run it with useful quality and context. |
| |
| ▲ | ndriscoll 21 minutes ago | parent | prev | next [-] | | Note that you could also run them on AMD (and presumably Intel) dGPUs. e.g. I have a 32GB R9700, which is much cheaper than a 5090, and runs 27B dense models at ~20 t/s (or MoE models with 3-4B active at ~80t/s). I expect an Arc B70 would also work soon if it doesn't already, and would likely be the price/perf sweet spot right now. My R9700 does seem to have an annoying firmware or driver bug[0] that causes the fan to usually be spinning at 100% regardless of temperature, which is very noisy and wastes like 20+ W, but I just moved my main desktop to my basement and use an almost silent N150 minipc as my daily driver now. [0] Or manufacturing defect? I haven't seen anyone discussing it online, but I don't know how many owners are out there. It's a Sapphire fwiw. It does sometimes spin down, the reported temperatures are fine, and IIRC it reports the fan speed as maxed out, so I assume software bug where it's just not obeying the fan curve | | |
| ▲ | zozbot234 2 minutes ago | parent [-] | | Yup, I suppose that these smaller, dense models are in the lead wrt. fast inference with consumer dGPUs (or iGPUs depending on total RAM) with just enough VRAM to contain the full model and context. That won't give you anywhere near SOTA results compared to larger MoE models with a similar amount of active parameters, but it will be quite fast. |
| |
| ▲ | muyuu 32 minutes ago | parent | prev [-] | | i have a Strix Halo machine typically those dense models are too slow on Strix Halo to be practical, expect 5-7 tps you can get an idea by looking at other dense benchmarks here: https://strixhalo.zurkowski.net/experiments - i'd expect this model to be tested here soon, i don't think i will personally bother |
| |
| ▲ | bityard 16 minutes ago | parent | prev | next [-] | | There are infinite combinations of CPU/GPU capable of running LLMs locally. What most people do is buy the system they can afford and roughly meets their goals and then ball-park VRAM usage by looking at the model size and quantization. For more a detailed analysis, there are several online VRAM calculators. Here's one: https://smcleod.net/vram-estimator/ If you have a huggingface account, you can set your system configuration and then you get little icons next to each quant in the sidebar. (Green: will likely fit, Yellow: Tight fit, Red: will not fit) Further, t/s depends greatly on a lot of different factors, the best you might get is a guess based on context size. One thing about running local LLMs right now, is that there are tradeoffs literally everywhere and you have to choose what to optimize for down to the individual task. | |
| ▲ | benob an hour ago | parent | prev | next [-] | | I get ~5 tokens/s on an M4 with 32G of RAM, using: llama-server \
-hf unsloth/Qwen3.6-27B-GGUF:Q4_K_M \
--no-mmproj \
--fit on \
-np 1 \
-c 65536 \
--cache-ram 4096 -ctxcp 2 \
--jinja \
--temp 0.6 \
--top-p 0.95 \
--top-k 20 \
--min-p 0.0 \
--presence-penalty 0.0 \
--repeat-penalty 1.0 \
--reasoning on \
--chat-template-kwargs '{"preserve_thinking": true}'
35B-A3B model is at ~25 t/s. For comparison, on an A100 (~RTX 3090 with more memory) they fare respectively at 41 t/s and 97 t/s.I haven't tested the 27B model yet, but 35B-A3B often gets off rails after 15k-20k tokens of context. You can have it to do basic things reliably, but certainly not at the level of "frontier" models. | | | |
| ▲ | proxysna an hour ago | parent | prev | next [-] | | Qwen3.5-27B with a 4bit quant can be run on a 24G card with no problem. With 2 Nvidia L4 cards and some additional vllm flags, i am serving 10 developers at 20-25tok/sek, off-peak is around 40tok/sek. Developers are ok with that performance, but ofc they requested more GPU's for added throughput. | | |
| ▲ | tandr 13 minutes ago | parent [-] | | What would be these additional vllm flags, if you don't mind sharing? |
| |
| ▲ | UncleOxidant 43 minutes ago | parent | prev | next [-] | | For Qwen3.5-27b I'm getting in the 20 to 25 tok/sec range on a 128GB Strix Halo box (Framework Desktop). That's with the 8-bit quant. It's definitely usable, but sometimes you're waiting a bit, though I'm not finding it problematic for the most part. I can run the Qwen3-coder-next (80b MoE) at 36tok/sec - hoping they release a Qwen3.6-coder soon. | | |
| ▲ | bityard 5 minutes ago | parent [-] | | I have a Framework Desktop too and 20-25 t/s is a lot better than I was expecting for such a large dense model. I'll have to try it out tonight. Are you using llama.cpp? |
| |
| ▲ | random3 21 minutes ago | parent | prev | next [-] | | Check out https://www.canirun.ai/ (and https://news.ycombinator.com/item?id=47363754) | |
| ▲ | chrsw an hour ago | parent | prev | next [-] | | These might help if the provider doesn't offer the same details themselves. Of course, we have to wait for the newly released models to get added to these sites. https://llmfit.io/ https://modelfit.io/ | |
| ▲ | xngbuilds 41 minutes ago | parent | prev | next [-] | | For Apple Mac, there is https://omlx.ai/benchmarks | |
| ▲ | ekojs an hour ago | parent | prev | next [-] | | As this is a dense model and it's pretty sizable, 4-bit quantization can be nearly lossless. With that, you can run this on a 3090/4090/5090. You can probably even go FP8 with 5090 (though there will be tradeoffs). Probably ~70 tok/s on a 5090 and roughly half that on a 4090/3090. With speculative decoding, you can get even faster (2-3x I'd say). Pretty amazing what you can get locally. | | |
| ▲ | Aurornis an hour ago | parent | next [-] | | > As this is a dense model and it's pretty sizable, 4-bit quantization can be nearly lossless The 4-bit quants are far from lossless. The effects show up more on longer context problems. > You can probably even go FP8 with 5090 (though there will be tradeoffs) You cannot run these models at 8-bit on a 32GB card because you need space for context. Typically it would be Q5 on a 32GB card to fit context lengths needed for anything other than short answers. | | |
| ▲ | ekojs 28 minutes ago | parent [-] | | > You cannot run these models at 8-bit on a 32GB card because you need space for context You probably can actually. Not saying that it would be ideal but it can fit entirely in VRAM (if you make sure to quantize the attention layers). KV cache quantization and not loading the vision tower would help quite a bit. Not ideal for long context, but it should be very much possible. I addressed the lossless claim in another reply but I guess it really depends on what the model is used for. For my usecases, it's nearly lossless I'd say. |
| |
| ▲ | zozbot234 44 minutes ago | parent | prev | next [-] | | 4-bit quantization is almost never lossless especially for agentic work, it's the lowest end of what's reasonable. It's advocated as preferable to a model with fewer parameters that's been quantized with more precision. | | |
| ▲ | ekojs 36 minutes ago | parent [-] | | Yeah, figure the 'nearly lossless' claim is the most controversial thing. But in my defense, ~97% recovery in benchmarks is what I consider 'nearly lossless'. When quantized with calibration data for a specialized domain, the difference in my internal benchmark is pretty much indistinguishable. But for agentic work, 4-bit quants can indeed fall a bit short in long-context usecase, especially if you quantize the attention layers. |
| |
| ▲ | binary132 an hour ago | parent | prev [-] | | That seems awfully speculative without at least some anecdata to back it up. | | |
| ▲ | arcanemachiner an hour ago | parent | next [-] | | Sure, go get some. This isn't the first open-weight LLM to be released. People tend to get a feel for this stuff over time. Let me give you some more baseless speculation: Based on the quality of the 3.5 27B and the 3.6 35B models, this model is going to absolutely crush it. | |
| ▲ | ekojs an hour ago | parent | prev [-] | | Not at all, I actually run ~30B dense models for production and have tested out 5090/3090 for that. There are gotchas of course, but the speed/quality claims should be roughly there. |
|
| |
| ▲ | rubiquity an hour ago | parent | prev | next [-] | | At 8-bit quantization (q8_0) I get 20 tokens per second on a Radeon R9700. | |
| ▲ | jjcm an hour ago | parent | prev | next [-] | | Fwiw, huggingface does this on the page where you download the weights. Slightly different format though - you put all the hardware you have, and it shows which quants you can run. | |
| ▲ | arcanemachiner an hour ago | parent | prev | next [-] | | Divide the value before the B by 2, and there's your answer if you get a Q4_K_M quant. Plus a bit of room for KV cache. TLDR: If you have 14GB of VRAM, you can try out this model with a 4-bit quant. Tokens per second is an unreasonable ask since every card is different, are you using GGUF or not, CUDA or ROCm or Vulkan or MLX, what optimizations are in your version of your inference software, flags are you running, etc. Note that it's a dense model (the Qwen models have another value at the end of the MoE model names, e.g. A3B) so it will not run very well in RAM, whereas with a MoE model, you can spill over into RAM if you don't have enough VRAM, and still have reasonable performance. Using these models requires some technical know-how, and there's no getting around that. | |
| ▲ | underlines an hour ago | parent | prev | next [-] | | depends on format, compute type, quantization and kv cache size. | | |
| ▲ | mottosso an hour ago | parent [-] | | Specs for whatever they used to achieve the benchmarks would be a good start. | | |
| ▲ | Aurornis an hour ago | parent [-] | | The benchmarks are from the unquantized model they release. This will only run on server hardware, some workstation GPUs, or some 128GB unified memory systems. It’s a situation where if you have to ask, you can’t run the exact model they released. You have to wait for quantizations to smaller sizes, which come in a lot of varieties and have quality tradeoffs. |
|
| |
| ▲ | jauntywundrkind 42 minutes ago | parent | prev [-] | | I would detest the time/words it takes to hand hold through such a review, of teaching folks the basics about LLM like this. It's also a section that, with hope, becomes obsolete sometime semi soon-ish. |
|
|
| ▲ | sietsietnoac an hour ago | parent | prev | next [-] |
| Generate an SVG of a pelican riding a bicycle:
https://codepen.io/chdskndyq11546/pen/yyaWGJx Generate an SVG of a dragon eating a hotdog while driving a car:
https://codepen.io/chdskndyq11546/pen/xbENmgK Far from perfect, but it really shows how powerful these models can get |
| |
| ▲ | yrds96 3 minutes ago | parent | next [-] | | I wonder if this became a so well known "benchmark" that models already got trained for it. | |
| ▲ | tln 7 minutes ago | parent | prev [-] | | The dragon image has issues like one eye, weird tail etc, but the pelican is imo perfect -- the best I've seen! |
|
|
| ▲ | vibe42 27 minutes ago | parent | prev | next [-] |
| Q4-Q5 quants of this model runs well on gaming laptops with 24GB VRAM and 64GB RAM. Can get one of those for around $3,500. Interesting pros/cons vs the new Macbook Pros depending on your prefs. And Linux runs better than ever on such machines. |
| |
| ▲ | kroaton 22 minutes ago | parent | next [-] | | A3B-35B is better suited for laptops with enough VRAM/RAM.
This dense model however will be bandwidth limited on most cards. The 5090RTX mobile sits at 896GB/s, as opposed to the 1.8TB/s of the 5090 desktop and most mobile chips have way smaller bandwith than that, so speeds won't be incredible across the board like with Desktop computers. | | |
| ▲ | jadbox 12 minutes ago | parent [-] | | I find A3B-35B as an ideal model for small local projects- definitely the best for me so far |
| |
| ▲ | doix 24 minutes ago | parent | prev [-] | | What laptop has that much VRAM and RAM for $3500 with good/okay-ish Linux support? I was looking to upgrade my asus zephyrus g14 from 2021 and things were looking very expensive. Decided to just keep it chugging along for another year. Then again, I was looking in the UK, maybe prices are extra inflated there. |
|
|
| ▲ | UncleOxidant an hour ago | parent | prev | next [-] |
| I've been waiting for this one. I've been using 3.5-27b with pretty good success for coding in C,C++ and Verilog. It's definitely helped in the light of less Claude availability on the Pro plan now. If their benchmarks are right then the improvement over 3.5 should mean I'm going to be using Claude even less. |
|
| ▲ | vladgur an hour ago | parent | prev | next [-] |
| This is getting very close to fit a single 3090 with 24gb VRAM :) |
| |
| ▲ | originalvichy an hour ago | parent | next [-] | | Yup! Smaller quants will fit within 24GB but they might sacrifice context length. I’m excited to try out the MLX version to see if 32GB of memory from a Pro M-series Mac can get some acceptable tok/s with longer context. HuggingFace has uploaded some MLX versions already. | | |
| ▲ | donmcronald 29 minutes ago | parent | next [-] | | I have an Mini M4 Pro with 64GB of 273GB/s memory bandwidth and it's borderline with 3.5-27B. I assume this one is the same. I don't know a ton, but I think it's the memory bandwidth that limits it. It's similar on a DGX Spark I have access to (almost the same memory bandwidth). It's been a while since I tried it, but I think I was getting around 12-15 tokens per second an that feels slow when you're used to the big commercial models. Whenever I actually want to do stuff with the open source models, I always find myself falling back to OpenRouter. I tried Intel/Qwen3.6-35B-A3B-int4-AutoRound on a DGX Spark a couple days ago and that felt usable speed wise. I don't know about quality, but that's like running a 3B parameter model. 27B is a lot slower. I'm not sure if I "get" the local AI stuff everyone is selling. I love the idea of it, but what's the point of 128GB of shared memory on a DGX Spark if I can only run a 20-30GB model before the slow speed makes it unusable? | |
| ▲ | ycui1986 an hour ago | parent | prev [-] | | 32GB RAM on mac also need to host OS, software, and other stuff. There may not even be 24GB VRAM left for the model. |
| |
| ▲ | GaggiX an hour ago | parent | prev [-] | | At 4-bit quantization it should already fit quite nicely. | | |
| ▲ | Aurornis an hour ago | parent [-] | | Unfortunately not with a reasonable context length. | | |
| ▲ | kkzz99 6 minutes ago | parent [-] | | It really depends on what you think a reasonable context length is, but I can get 50k-60k on a 4090. |
|
|
|
|
| ▲ | originalvichy an hour ago | parent | prev | next [-] |
| Good news! Friendly reminder: wait a couple weeks to judge the ”final” quality of these free models. Many of them suffer from hidden bugs when connected to an inference backend or bad configs that slow them down. The dev community usually takes a week or two to find the most glaring issues. Some of them may require patches to tools like llama.cpp, and some require users to avoid specific default options. Gemma 4 had some issues that were ironed out within a week or two. This model is likely no different. Take initial impressions with a grain of salt. |
| |
| ▲ | jjcm an hour ago | parent | next [-] | | This is probably less likely with this model, as it’s almost certainly a further RL training continuation of 3.5 27b. The bugs with this architecture were worked out when that dropped. | | | |
| ▲ | Aurornis an hour ago | parent | prev [-] | | Good advice for all new LLM experimenters. The bugs come from the downstream implementations and quantizations (which inherit bugs in the tools). Expect to update your tools and redownload the quants multiple times over 2-4 weeks. There is a mad rush to be first to release quants and first to submit PRs to the popular tools, but the output is often not tested much before uploading. If you experiment with these on launch week, you are the tester. :) |
|
|
| ▲ | amunozo 2 hours ago | parent | prev | next [-] |
| A bit skeptical about a 27B model comparable to opus... |
| |
| ▲ | originalvichy an hour ago | parent | next [-] | | For at least a year now, it has been clear that data quality and fine-tuning are the main sources of improvement for mediym-level models. Size != quality for specialized, narrow use cases such as coding. It’s not a surprise that models are leapfrogging each other when the engineers are able to incorporate better code examples and reasoning traces, which in turn bring higher quality outputs. | | |
| ▲ | cbg0 20 minutes ago | parent [-] | | If all you're looking at is benchmarks that might be true, but those are way too easy to game. Try using this model alongside Opus for some work in Rust/C++ and it'll be night and day. You really can't compare a model that's got trillions of parameters to a 27B one. |
| |
| ▲ | rubiquity 41 minutes ago | parent | prev | next [-] | | You should try it out. I'm incredibly impressed with Qwen 3.5 27B for systems programming work. I use Opus and Sonnet at work and Qwen 3.x at home for fun and barely notice a difference given that systems programming work needs careful guidance for any model currently. I don't try to one shot landing pages or whatever. | |
| ▲ | Aurornis an hour ago | parent | prev | next [-] | | You should be skeptical. Benchmark racing is the current meta game in open weight LLMs. Every release is accompanied by claims of being as good as Sonnet or Opus, but when I try them (even hosted full weights) they’re far from it. Impressive for the size, though! | |
| ▲ | jjcm an hour ago | parent | prev | next [-] | | Opus 4.5 mind you, but I’m not too surprised given how good 3.5 was and how good the qwopus fine tune was. The model was shown to benefit heavily from further RL. | |
| ▲ | esafak an hour ago | parent | prev | next [-] | | Some of these benchmarks are supposedly easy to game. Which ones should we pay attention to? | | |
| ▲ | underlines an hour ago | parent | next [-] | | well, your own, unleaked ones, representing your real workloads. if you can't afford to do that, look at a lot of them, eg. on artificialanalysis.com they merge multiple benchmarks across weighted categories and build an Intelligence Score, Coding Score and Agentic score. | |
| ▲ | WarmWash an hour ago | parent | prev [-] | | ARC-AGI 2 GLM 5 scores 5% on the semi-private set, compared to SOTA models which hover around 80%. |
| |
| ▲ | cmrdporcupine 44 minutes ago | parent | prev | next [-] | | A small model can be made to be "comparable to Opus" in some narrow domains, and that's what they've done here. But when actually employed to write code they will fall over when they leave that specific domain. Basically they might have skill but lack wisdom. Certainly at this size they will lack anywhere close to the same contextual knowledge. Still these things could be useful in the context of more specialized tooling, or in a harness that heavily prompts in the right direction, or as a subagent for a "wiser" larger model that directs all the planning and reviews results. | |
| ▲ | wesammikhail an hour ago | parent | prev [-] | | you'd be surprised how good small models have gotten. Size of the model isnt all that matters. | | |
| ▲ | freedomben an hour ago | parent | next [-] | | Plus you can control thinking time a lot more, so when Anthropic lobotomizes Opus on you... | |
| ▲ | dudefeliciano an hour ago | parent | prev | next [-] | | > Size of the model isnt all that matters. What matters is the motion in the tokens | |
| ▲ | verdverm an hour ago | parent | prev [-] | | My experience with qwen-3.6:35B-A3B reinforces this, gonna give this a spin when unsloth has quants available Gemini flash was just as good as pro for most tasks with good prompts, tools, and context. Gemma 4 was nearly as good as flash and Qwen 3.6 appears to be even better. | | |
|
|
|
| ▲ | pama an hour ago | parent | prev | next [-] |
| Has anyone tested it at home yet and wants to share early impressions? |
| |
| ▲ | lreeves an hour ago | parent [-] | | I have been kicking the tires for about 40 minutes since it downloaded and it seems excellent at general tasks, image comprehension and coding/tool-calling (using VLLM to serve it). I think it squeaks past Gemma4 but it's hard to tell yet. | | |
| ▲ | alfonsodev an hour ago | parent [-] | | good to hear! Do you mind sharing your setup and tokens / seconds performance ? | | |
| ▲ | lreeves 15 minutes ago | parent [-] | | I'm running the unquantized base model on 2xA6000s (Ampere gen, 48GB each). Runs at about 25 tokens/second. |
|
|
|
|
| ▲ | spwa4 an hour ago | parent | prev | next [-] |
| Unsloth quants available: https://unsloth.ai/docs/models/qwen3.6 |
| |
| ▲ | genpfault 40 minutes ago | parent | next [-] | | Getting ~36-33 tok/s (see the "S_TG t/s" column) on a 24GB Radeon RX 7900 XTX using llama.cpp's Vulkan backend: $ llama-server --version
version: 8851 (e365e658f)
$ llama-batched-bench -hf unsloth/Qwen3.6-27B-GGUF:IQ4_XS -npp 1000,2000,4000,8000,16000,32000 -ntg 128 -npl 1 -c 34000
| PP | TG | B | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s | T s | S t/s |
|-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
| 1000 | 128 | 1 | 1128 | 1.529 | 654.11 | 3.470 | 36.89 | 4.999 | 225.67 |
| 2000 | 128 | 1 | 2128 | 3.064 | 652.75 | 3.498 | 36.59 | 6.562 | 324.30 |
| 4000 | 128 | 1 | 4128 | 6.180 | 647.29 | 3.535 | 36.21 | 9.715 | 424.92 |
| 8000 | 128 | 1 | 8128 | 12.477 | 641.16 | 3.582 | 35.73 | 16.059 | 506.12 |
| 16000 | 128 | 1 | 16128 | 25.849 | 618.98 | 3.667 | 34.91 | 29.516 | 546.42 |
| 32000 | 128 | 1 | 32128 | 57.201 | 559.43 | 3.825 | 33.47 | 61.026 | 526.47 |
| |
| ▲ | GrinningFool 15 minutes ago | parent | prev | next [-] | | 128GB (112 GB avail) Strix AI 395+ Radeon 8060x (gfx1151) llama-* version 8889 w/ rocm support ; nightly rocm llama.cpp/build/bin/llama-batched-bench --version unsloth/Qwen3.6-27B-GGUF:UD-Q8_K_XL -npp 1000,2000,4000,8000,16000,32000 -ntg 128 -npl 1 -c 34000 | PP | TG | B | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s | T s | S t/s |
|-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
| 1000 | 128 | 1 | 1128 | 2.776 | 360.22 | 20.192 | 6.34 | 22.968 | 49.11 |
| 2000 | 128 | 1 | 2128 | 5.778 | 346.12 | 20.211 | 6.33 | 25.990 | 81.88 |
| 4000 | 128 | 1 | 4128 | 11.723 | 341.22 | 20.291 | 6.31 | 32.013 | 128.95 |
| 8000 | 128 | 1 | 8128 | 24.223 | 330.26 | 20.399 | 6.27 | 44.622 | 182.15 |
| 16000 | 128 | 1 | 16128 | 52.521 | 304.64 | 20.669 | 6.19 | 73.190 | 220.36 |
| 32000 | 128 | 1 | 32128 | 120.333 | 265.93 | 21.244 | 6.03 | 141.577 | 226.93 |
More directly comparable to the results posted by genpfault (IQ4_XS):llama.cpp/build/bin/llama-batched-bench -hf unsloth/Qwen3.6-27B-GGUF:IQ4_XS -npp 1000,2000,4000,8000,16000,32000 -ntg 128 -npl 1 -c 34000 | PP | TG | B | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s | T s | S t/s |
|-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
| 1000 | 128 | 1 | 1128 | 2.543 | 393.23 | 9.829 | 13.02 | 12.372 | 91.17 |
| 2000 | 128 | 1 | 2128 | 5.400 | 370.36 | 9.891 | 12.94 | 15.291 | 139.17 |
| 4000 | 128 | 1 | 4128 | 10.950 | 365.30 | 9.972 | 12.84 | 20.922 | 197.31 |
| 8000 | 128 | 1 | 8128 | 22.762 | 351.46 | 10.118 | 12.65 | 32.880 | 247.20 |
| 16000 | 128 | 1 | 16128 | 49.386 | 323.98 | 10.387 | 12.32 | 59.773 | 269.82 |
| 32000 | 128 | 1 | 32128 | 114.218 | 280.16 | 10.950 | 11.69 | 125.169 | 256.68 |
| |
| ▲ | endymi0n an hour ago | parent | prev [-] | | at this trajectory, unsloth are going to release the models BEFORE the model drop within the next weeks... | | |
|
|
| ▲ | Mr_Eri_Atlov 38 minutes ago | parent | prev [-] |
| Excited to try this, the Qwen 3.6 MoE they just released a week or so back had a noticeable performance bump from 3.5 in a rather short period of time. For anyone invested in running LLMs at home or on a much more modest budget rig for corporate purposes, Gemma 4 and Qwen 3.6 are some of the most promising models available. |