| ▲ | anonzzzies 3 hours ago |
| I wish that all announcements of models would show what (consumer) hardware you can run this on today, costs and tok/s. |
|
| ▲ | Aurornis 3 hours ago | parent | next [-] |
| The 27B model they release directly would require significant hardware to run natively at 16-bit: A Mac or Strix Halo 128GB system, multiple high memory consumer GPUs, or an RTX 6000 workstation card. This is why they don’t advertise which consumer hardware it can run on: Their direct release that delivers these results cannot fit on your average consumer system. Most consumers don’t run the model they release directly. They run a quantized model that uses a lower number of bits per weight. The quantizations come with tradeoffs. You will not get the exact results they advertise using a quantized version, but you can fit it on smaller hardware. The previous 27B Qwen3.5 model had reasonable performance down to Q5 or Q4 depending on your threshold for quality loss. This was usable on a unified memory system (Mac, Strix Halo) with 32GB of extra RAM, so generally a 64GB Mac. They could also be run on an nVidia 5090 with 32GB RAM or a pair of 16GB or 24GB GPUs, which would not run as fast due to the split. Watch out for some of the claims about running these models on iPhones or smaller systems. You can use a lot of tricks and heavy quantization to run it on very small systems but the quality of output will not be usable. There is a trend of posting “I ran this model and this small hardware” repos for social media bragging rights but the output isn’t actually good. |
| |
| ▲ | ryandrake 2 hours ago | parent | next [-] | | Yea, this is currently the confusing part of running local models for newbies: Even after you have decided which model you want to run, and which org's quantizations to use (let's just assume Unsloth's for example), there are often dozens of quantizations offered, and choosing among them is confusing. Say you have a GPU with 20GB of VRAM. You're probably going to be able to run all the 3-bit quantizations with no problem, but which one do you choose? Unsloth offers[1] four of them: UD-IQ3_XXS, Q3_K_S, Q3_K_M, UD-Q3_K_XL. Will they differ significantly? What are each of them good at? The 4-bit quantizations will be a "tight squeeze" on your 20GB GPU. Again, Unsloth steps up to the plate with seven(!!) choices: IQ4_XS, Q4_K_S, IQ4_NL, Q4_0, Q4_1, Q4_K_M, UD-Q4_K_XL. Holy shit where do I even begin? You can try each of them to see what fits on your GPU, but that's a lot of downloading, and then... Once you [guess and] commit to one of the quantizations and do a gigantic download, you're not done fiddling. You need to decide at the very least how big a context window you need, and this is going to be trial and error. Choose a value, try to load the model, if it fails, you chose too large. Rinse and repeat. Then finally, you're still not done. Don't forget the parameters: temperature, top_p, top_k, and so on. It's bewildering! 1: https://huggingface.co/unsloth/Qwen3.6-27B-GGUF | | |
| ▲ | danielhanchen 2 hours ago | parent | next [-] | | We made Unsloth Studio which should help :) 1. Auto best official parameters set for all models 2. Auto determines the largest quant that can fit on your PC / Mac etc 3. Auto determines max context length 4. Auto heals tool calls, provides python & bash + web search :) | | |
| ▲ | cyanydeez 2 minutes ago | parent | next [-] | | Is unsloth working on managing remote servers, like how vscode integrates with a remote server via ssh? | |
| ▲ | ryandrake an hour ago | parent | prev | next [-] | | Yea, I actually tried it out last time we had one of these threads. It's undeniably easy to use, but it is also very opinionated about things like the directory locations/layouts for various assets. I don't think I managed to get it to work with a simple flat directory full of pre-downloaded models on an NFS mount to my NAS. It also insists on re-downloading a 3GB model every time it is launches, even after I delete the model file. I probably have to just sit down and do some Googleing/searching in order to rein the software in and get it to work the way I want it to on my system. | |
| ▲ | hypercube33 22 minutes ago | parent | prev | next [-] | | Sadly doesn't support fine tuning on AMD yet which gave me a sad since I wanted to cut one of these down to be specific domain experts. Also running the studio is a bit of a nightmare when it calls diskpart during its install (why?) | |
| ▲ | jbellis 14 minutes ago | parent | prev | next [-] | | what are you using for web search? | |
| ▲ | wuschel an hour ago | parent | prev [-] | | Great project! Thank you for that! |
| |
| ▲ | Aurornis 2 hours ago | parent | prev [-] | | > Say you have a GPU with 20GB of VRAM. You're probably going to be able to run all the 3-bit quantizations with no problem, but which one do you choose? Unsloth offers[1] four of them: UD-IQ3_XXS, Q3_K_S, Q3_K_M, UD-Q3_K_XL There are actually two problems with this: First, the 3-bit quants are where the quality loss really becomes obvious. You can get it to run, but you’re not getting the quality you expected. The errors compound over longer sessions. Second, you need room for context. If you have become familiar with the long 200K contexts you get with SOTA models, you will not be happy with the minimal context you can fit into a card with 16-20GB of RAM. The challenge for newbies is learning to identify the difference between being able to get a model to run, and being able to run it with useful quality and context. | | |
| ▲ | smallerize 6 minutes ago | parent | next [-] | | I found the KLD benchmark image at the bottom of https://unsloth.ai/docs/models/qwen3.6 to be very helpful when choosing a quant. | |
| ▲ | zargon 13 minutes ago | parent | prev | next [-] | | Qwen3.5 series is a little bit of an exception to the general rule here. It is incredibly kv cache size efficient. I think the max context fits in 3GB at q8 iirc. I prefer to keep the cache at full precision though. | |
| ▲ | ryandrake an hour ago | parent | prev [-] | | Yea, I'm also kind of jealous of Apple folks with their unified RAM. On a traditional homelab setup with gobs of system RAM and a GPU with relatively little VRAM, all that system RAM sits there useless for running LLMs. | | |
| ▲ | zozbot234 an hour ago | parent | next [-] | | That "traditional" setup is the recommended setup for running large MoE models, leaving shared routing layers on the GPU to the extent feasible. You can even go larger-than-system-RAM via mmap, though at a non-trivial cost in throughput. | |
| ▲ | 28 minutes ago | parent | prev | next [-] | | [deleted] | |
| ▲ | khimaros 32 minutes ago | parent | prev [-] | | Strix Halo is another option |
|
|
| |
| ▲ | ndriscoll 2 hours ago | parent | prev | next [-] | | Note that you could also run them on AMD (and presumably Intel) dGPUs. e.g. I have a 32GB R9700, which is much cheaper than a 5090, and runs 27B dense models at ~20 t/s (or MoE models with 3-4B active at ~80t/s). I expect an Arc B70 would also work soon if it doesn't already, and would likely be the price/perf sweet spot right now. My R9700 does seem to have an annoying firmware or driver bug[0] that causes the fan to usually be spinning at 100% regardless of temperature, which is very noisy and wastes like 20+ W, but I just moved my main desktop to my basement and use an almost silent N150 minipc as my daily driver now. [0] Or manufacturing defect? I haven't seen anyone discussing it online, but I don't know how many owners are out there. It's a Sapphire fwiw. It does sometimes spin down, the reported temperatures are fine, and IIRC it reports the fan speed as maxed out, so I assume software bug where it's just not obeying the fan curve | | |
| ▲ | acrispino 11 minutes ago | parent | next [-] | | I have 2x asrock R9700. One of the them was noticeably noisier than the other and eventually developed an annoying vibration while in the middle of its fan curve. Asrock replaced it under RMA. | |
| ▲ | zozbot234 2 hours ago | parent | prev [-] | | Yup, I suppose that these smaller, dense models are in the lead wrt. fast inference with consumer dGPUs (or iGPUs depending on total RAM) with just enough VRAM to contain the full model and context. That won't give you anywhere near SOTA results compared to larger MoE models with a similar amount of active parameters, but it will be quite fast. |
| |
| ▲ | alex7o 16 minutes ago | parent | prev | next [-] | | Because when you pay for a subscription they don't silently quantize the model a few week after release, and you can no longer get the full model running. Otherwise no need for full fp16, int8 works 99% as well for half the mem, and the lower you go the more you start to pay for the quants. But int8 is super safe imo. | |
| ▲ | muyuu 2 hours ago | parent | prev | next [-] | | i have a Strix Halo machine typically those dense models are too slow on Strix Halo to be practical, expect 5-7 tps you can get an idea by looking at other dense benchmarks here: https://strixhalo.zurkowski.net/experiments - i'd expect this model to be tested here soon, i don't think i will personally bother | | |
| ▲ | hedgehog 2 hours ago | parent [-] | | This one is around 250 t/s prefill and 12.4 generation in my testing. |
| |
| ▲ | wuschel an hour ago | parent | prev | next [-] | | > but the quality of output will not be usable Making the the right pick for model is one of the key problems as a local user. Do you have any references where one can see a mapping of problem query to model response quality? | |
| ▲ | Oras 2 hours ago | parent | prev | next [-] | | If these models reach quality of Opus 4.5, then DGX could be a good alternative for serious dev teams to run local models. It is not that expensive and has short time to make ROI | |
| ▲ | anonym29 an hour ago | parent | prev [-] | | You absolutely do not need to run at full BF16. The quality loss between BF16 (55.65 GB in GGUF) and Q8_0 (30.44 GB in GGUF) is essentially zero - think on the order of magnitude of +0.01-0.03 perplexity, or ~0.1-0.3% relative PPL increase. The quality loss between BF16 and Q4_K_M (18.66 GB in GGUF) is close to imperceptible, with perplexity changes in the +0.1-0.3 ballpark, or ~1-3% relative PPL increase. This would correlate to a 0-2% drop on downstream tasks like MMLU/GSM8K/HellaSwag: essentially indistinguishable. You absolutely do NOT need a $3000 Strix Halo rig or a $4000 Mac or a $9000 RTX 6000 or "multiple high memory consumer GPUs" to run this model at extremely high accuracy. I say this as a huge Strix Halo fanboy (Beelink GTR 9 Pro), mind you. Where Strix Halo is more necessary (and actually offers much better performance) are larger but sparse MoE models - think Qwen 3.5 122B A10B - which offers the total knowledge (and memory requirements) of a 122B model, with processing and generation speed more akin to a 10B dense model, which is a big deal with the limited MBW we get in the land of Strix Halo (256 GB/s theoretical, ~220 GB/s real-world) and DGX Spark (273 GB/s theoretical - not familiar with real-world numbers specifically off the top of my head). I would make the argument, as a Strix Halo owner, that 27B dense models are actually not particularly pleasant or snappy to run on Strix Halo, and you're much better off with those larger but sparse MoE models with far fewer active parameters on such systems. I'd much rather have an RTX 5090, an Arc B70 Pro, or an AMD AI PRO R9700 (dGPUs with 32GB of GDDR6/7) for 27B dense models specifically. | | |
| ▲ | zozbot234 an hour ago | parent | next [-] | | I'm all for running large MoE models on unified memory systems, but developers of inference engines should do a better job of figuring out how to run larger-than-total-RAM models on such systems, streaming in sparse weights from SSD but leveraging the large unified memory as cache. This is easily supported with pure-CPU inference via mmap, but there is no obvious equivalent when using the GPU for inference. | | |
| ▲ | anonym29 an hour ago | parent [-] | | I use llama.cpp, and there is a way to do this - some layers to (i)GPU, the rest to CPU. I was just trying this out with Kimi K2.5 (in preparation for trying it out with Kimi K2.6 the other night. Check out the --n-cpu-moe flag in llama.cpp. That said, my Strix Halo rig only has PCIe 4.0 for my NVMe, and I'm using a 990 Evo that had poor sustained random read, being DRAM-less. My effective read speeds from disk were averaging around 1.6-2.0 GB/s, and with unsloth's K2.5, even in IQ2_XXS at "just" 326 GB, with ~64 GB worth of layers in iGPU and the rest free for KV cache + checkpoints. Even still, that was over 250 GB of weights streaming at ~2 GB/s, so I was getting 0.35 PP tok/s and 0.22 TG tok/s. I could go a little faster with a better drive, or a little faster still if I dropping in two of em in raid0, but it would still be on the order of magnitude of sub-1 tok/s PP (compute limited) and TG (bandwidth limited). | | |
| ▲ | adrian_b 6 minutes ago | parent [-] | | In a computer with 2 PCIe 5.0 SSDs or one with a PCIe 5.0 SSDs and a PCIe 4.0 SSD, it should be possible to stream weights from the SSDs at 20 GB/s, or even more. This is not a little faster, but 10 times faster than on your system. So a couple of tokens per second generation speed should be achievable. Nowadays even many NUCs or NUC-like mini-PCs have such SSD slots. I have actually started working at optimizing such an inference system, so your data is helpful for comparison. |
|
| |
| ▲ | p_stuart82 34 minutes ago | parent | prev [-] | | tbh ~1-3% PPL hit from Q4_K_M stopped being the bottleneck a while ago. the bottleneck is the 48 hours of guessing llama.cpp flags and chat template bugs before the ecosystem catches up. you are doing unpaid QA. | | |
| ▲ | anonym29 22 minutes ago | parent [-] | | Just wait a week for model bugs to be worked out. This is well-known advice and a common practice within r/localllama. The flags are not hard at all if you're using llama.cpp regularly. If you're new to the ecosystem, that's closer to a one-time effort with irregular updates than it is to something you have to re-learn for every model. |
|
|
|
|
| ▲ | benob 3 hours ago | parent | prev | next [-] |
| I get ~5 tokens/s on an M4 with 32G of RAM, using: llama-server \
-hf unsloth/Qwen3.6-27B-GGUF:Q4_K_M \
--no-mmproj \
--fit on \
-np 1 \
-c 65536 \
--cache-ram 4096 -ctxcp 2 \
--jinja \
--temp 0.6 \
--top-p 0.95 \
--top-k 20 \
--min-p 0.0 \
--presence-penalty 0.0 \
--repeat-penalty 1.0 \
--reasoning on \
--chat-template-kwargs '{"preserve_thinking": true}'
35B-A3B model is at ~25 t/s. For comparison, on an A100 (~RTX 3090 with more memory) they fare respectively at 41 t/s and 97 t/s.I haven't tested the 27B model yet, but 35B-A3B often gets off rails after 15k-20k tokens of context. You can have it to do basic things reliably, but certainly not at the level of "frontier" models. |
| |
| ▲ | danielhanchen 2 hours ago | parent | next [-] | | We also made some dynamic MLX ones if they help - it might be faster for Macs, but llama-server definitely is improving at a fast pace. https://huggingface.co/unsloth/Qwen3.6-27B-UD-MLX-4bit | | | |
| ▲ | dunb 2 hours ago | parent | prev | next [-] | | Why use --fit on on an M4? My understanding was that given the unified memory, you should push all layers to the GPU with --n-gpu-layers all. Setting --flash-attn on and --no-mmap may also get you better results. | |
| ▲ | wuschel 42 minutes ago | parent | prev | next [-] | | How is the quality of model answers to your queries? Are they stable over time? I am wondering how to measure that anyway. | |
| ▲ | kpw94 an hour ago | parent | prev [-] | | When you say tok/s here are you describing the prefill (prompt eval) token/s or the output generation tok/s? (Btw I believe the "--jinja" flag is by default true since sometime late 2025, so not needed anymore) | | |
| ▲ | zargon an hour ago | parent [-] | | If someone doesn't specifically say prefill then they always mean decode speed. I have never seen an exception. Most people just ignore prefill. | | |
| ▲ | kpw94 32 minutes ago | parent [-] | | But isn't the prefill speed the bottleneck in some systems* ? Sure it's order of magnitude faster (10x on Apple Metal?) but there's also order of magnitude more tokens to process, especially for tasks involving summarization of some sort. But point taken that the parent numbers are probably decode * Specifically, Mac metal, which is what parent numbers are about | | |
| ▲ | zargon a minute ago | parent [-] | | Yes, definitely it's the bottleneck for most use cases besides "chatting". It's the reason I have never bought a Mac for LLM purposes. It's frustrating when trying to find benchmarks because almost everyone gives decode speed without mentioning prefill speed. |
|
|
|
|
|
| ▲ | bityard 2 hours ago | parent | prev | next [-] |
| There are infinite combinations of CPU/GPU capable of running LLMs locally. What most people do is buy the system they can afford and roughly meets their goals and then ball-park VRAM usage by looking at the model size and quantization. For more a detailed analysis, there are several online VRAM calculators. Here's one: https://smcleod.net/vram-estimator/ If you have a huggingface account, you can set your system configuration and then you get little icons next to each quant in the sidebar. (Green: will likely fit, Yellow: Tight fit, Red: will not fit) Further, t/s depends greatly on a lot of different factors, the best you might get is a guess based on context size. One thing about running local LLMs right now, is that there are tradeoffs literally everywhere and you have to choose what to optimize for down to the individual task. |
|
| ▲ | proxysna 3 hours ago | parent | prev | next [-] |
| Qwen3.5-27B with a 4bit quant can be run on a 24G card with no problem. With 2 Nvidia L4 cards and some additional vllm flags, i am serving 10 developers at 20-25tok/sek, off-peak is around 40tok/sek. Developers are ok with that performance, but ofc they requested more GPU's for added throughput. |
| |
| ▲ | tandr 2 hours ago | parent | next [-] | | What would be these additional vllm flags, if you don't mind sharing? | |
| ▲ | PcChip an hour ago | parent | prev [-] | | question: why not use something like Claude?
is it for security reasons? | | |
| ▲ | lambda 14 minutes ago | parent [-] | | Some people would rather not hand over all of their ability to think to a single SaaS company that arbitrarily bans people, changes token limits, tweaks harnesses and prompts in ways that cause it to consume too many tokens, or too few to complete the task, etc. I don't use any non-FLOSS dev tools; why would I suddenly pay for a subscription to a single SaaS provider with a proprietary client that acts in opaque and user hostile ways? |
|
|
|
| ▲ | UncleOxidant 2 hours ago | parent | prev | next [-] |
| For Qwen3.5-27b I'm getting in the 20 to 25 tok/sec range on a 128GB Strix Halo box (Framework Desktop). That's with the 8-bit quant. It's definitely usable, but sometimes you're waiting a bit, though I'm not finding it problematic for the most part. I can run the Qwen3-coder-next (80b MoE) at 36tok/sec - hoping they release a Qwen3.6-coder soon. |
| |
| ▲ | lambda 12 minutes ago | parent | next [-] | | That sounds high for a Strix Halo with a dense 27b model. Are you talking about decode (prompt eval, which can happend in parallel) or generation when you quote tokens per second? Usually if people quote only one number they're quoting generation speed, and I would be surprised if you got that for generation speed on a Strix Halo. | |
| ▲ | bityard 2 hours ago | parent | prev | next [-] | | I have a Framework Desktop too and 20-25 t/s is a lot better than I was expecting for such a large dense model. I'll have to try it out tonight. Are you using llama.cpp? | | |
| ▲ | UncleOxidant an hour ago | parent [-] | | LMStudio, but it uses llama.cpp to run inference, so yeah. This is with the vulkan backend, not ROCm. |
| |
| ▲ | petu an hour ago | parent | prev [-] | | > Qwen3.5-27b 8-bit quant 20 to 25 tok/sec It that with some kind of speculative decoding? Or total throughput for parallel requests? |
|
|
| ▲ | ekojs 3 hours ago | parent | prev | next [-] |
| As this is a dense model and it's pretty sizable, 4-bit quantization can be nearly lossless. With that, you can run this on a 3090/4090/5090. You can probably even go FP8 with 5090 (though there will be tradeoffs). Probably ~70 tok/s on a 5090 and roughly half that on a 4090/3090. With speculative decoding, you can get even faster (2-3x I'd say). Pretty amazing what you can get locally. |
| |
| ▲ | Aurornis 2 hours ago | parent | next [-] | | > As this is a dense model and it's pretty sizable, 4-bit quantization can be nearly lossless The 4-bit quants are far from lossless. The effects show up more on longer context problems. > You can probably even go FP8 with 5090 (though there will be tradeoffs) You cannot run these models at 8-bit on a 32GB card because you need space for context. Typically it would be Q5 on a 32GB card to fit context lengths needed for anything other than short answers. | | |
| ▲ | alex7o 11 minutes ago | parent | next [-] | | Turboquant on 4bit helps a lot as well for keeping context in vram, but int4 is definitely not lossless. But it all depends for some people this is sufficient | |
| ▲ | ekojs 2 hours ago | parent | prev [-] | | > You cannot run these models at 8-bit on a 32GB card because you need space for context You probably can actually. Not saying that it would be ideal but it can fit entirely in VRAM (if you make sure to quantize the attention layers). KV cache quantization and not loading the vision tower would help quite a bit. Not ideal for long context, but it should be very much possible. I addressed the lossless claim in another reply but I guess it really depends on what the model is used for. For my usecases, it's nearly lossless I'd say. |
| |
| ▲ | zozbot234 2 hours ago | parent | prev | next [-] | | 4-bit quantization is almost never lossless especially for agentic work, it's the lowest end of what's reasonable. It's advocated as preferable to a model with fewer parameters that's been quantized with more precision. | | |
| ▲ | ekojs 2 hours ago | parent [-] | | Yeah, figure the 'nearly lossless' claim is the most controversial thing. But in my defense, ~97% recovery in benchmarks is what I consider 'nearly lossless'. When quantized with calibration data for a specialized domain, the difference in my internal benchmark is pretty much indistinguishable. But for agentic work, 4-bit quants can indeed fall a bit short in long-context usecase, especially if you quantize the attention layers. |
| |
| ▲ | binary132 2 hours ago | parent | prev [-] | | That seems awfully speculative without at least some anecdata to back it up. | | |
| ▲ | arcanemachiner 2 hours ago | parent | next [-] | | Sure, go get some. This isn't the first open-weight LLM to be released. People tend to get a feel for this stuff over time. Let me give you some more baseless speculation: Based on the quality of the 3.5 27B and the 3.6 35B models, this model is going to absolutely crush it. | |
| ▲ | ekojs 2 hours ago | parent | prev [-] | | Not at all, I actually run ~30B dense models for production and have tested out 5090/3090 for that. There are gotchas of course, but the speed/quality claims should be roughly there. |
|
|
|
| ▲ | xngbuilds 2 hours ago | parent | prev | next [-] |
| For Apple Mac, there is https://omlx.ai/benchmarks |
|
| ▲ | chrsw 3 hours ago | parent | prev | next [-] |
| These might help if the provider doesn't offer the same details themselves. Of course, we have to wait for the newly released models to get added to these sites. https://llmfit.io/ https://modelfit.io/ |
|
| ▲ | rubiquity 2 hours ago | parent | prev | next [-] |
| At 8-bit quantization (q8_0) I get 20 tokens per second on a Radeon R9700. |
|
| ▲ | random3 2 hours ago | parent | prev | next [-] |
| Check out https://www.canirun.ai/ (and https://news.ycombinator.com/item?id=47363754) |
| |
| ▲ | esskay an hour ago | parent [-] | | CaniRun's not a great tool - look how long its been since it's been updated. It's not got any of the qwen3.6 models on the list nor the new kimi one. In fact it's missing many of the "popular" models. |
|
|
| ▲ | jjcm 3 hours ago | parent | prev | next [-] |
| Fwiw, huggingface does this on the page where you download the weights. Slightly different format though - you put all the hardware you have, and it shows which quants you can run. |
|
| ▲ | neonstatic 21 minutes ago | parent | prev | next [-] |
| I just ran Unsloth's Q4_K_M quant on Macbook Pro M1 Max (2021), 32 GB RAM. Getting about 9 tok/s. Reported RAM usage 15.9 GB. It's okay, but a bit slow for my taste. An example of a query and a response is below. It was done with 9.09 tok/s, 893 tokens, 7.56 seconds to first token, thought for 1 minute and 9 seconds. Question: I gave the model this truth table and asked to tell me what boolean function would produce the output given A, B as inputs. | A | B | Output |
| :--- | --- | ---: |
| (T) | (F) | (T) |
| (T) | (T) | (F) |
| (F) | (F) | (F) |
| (F) | (T) | (F) |
Answer: (note this is part of a larger conversation, where I was getting it to help me with some boolean logic): The boolean condition for your table is simply:
```rust
A && !B
```
### Why?
Your table shows the output is `true` **only** when `A` is true and `B` is false. That's exactly what `A && !B` expresses.
| A | B | `A && !B` | Output |
|---|---|-----------|--------|
| (T) | (F) | `T && T = T` | |
| (T) | (T) | `T && F = F` | |
| (F) | (F) | `F && T = F` | |
| (F) | (T) | `F && F = F` | |
### In Rust:
```rust
if A && !B {
// output is true
} else {
// output is false
}
```
This is the most direct and idiomatic way to express that logic. Let me know if you need it adapted back to your `live_ticks` / `on_disconnect` context!
|
|
| ▲ | scosman 37 minutes ago | parent | prev | next [-] |
| this is what the unsloth post is for |
|
| ▲ | arcanemachiner 2 hours ago | parent | prev | next [-] |
| Divide the value before the B by 2, and there's your answer if you get a Q4_K_M quant. Plus a bit of room for KV cache. TLDR: If you have 14GB of VRAM, you can try out this model with a 4-bit quant. Tokens per second is an unreasonable ask since every card is different, are you using GGUF or not, CUDA or ROCm or Vulkan or MLX, what optimizations are in your version of your inference software, flags are you running, etc. Note that it's a dense model (the Qwen models have another value at the end of the MoE model names, e.g. A3B) so it will not run very well in RAM, whereas with a MoE model, you can spill over into RAM if you don't have enough VRAM, and still have reasonable performance. Using these models requires some technical know-how, and there's no getting around that. |
|
| ▲ | CamperBob2 an hour ago | parent | prev | next [-] |
| 27B dense is not a screamer, even on an RTX 6000, but it will run at full precision with (more than) enough room for context at the model's own capacity. You can expect about 30 tokens/second after prompt processing. Quants will likely run similarly well on the 16/24/32 GB consumer GPUs. The 3.5 27B model was a strong and capable reasoner, so I have high hopes for this one. Thanks to the team at Qwen for keeping competition in this space alive. |
|
| ▲ | underlines 3 hours ago | parent | prev | next [-] |
| depends on format, compute type, quantization and kv cache size. |
| |
| ▲ | mottosso 3 hours ago | parent [-] | | Specs for whatever they used to achieve the benchmarks would be a good start. | | |
| ▲ | bityard 2 hours ago | parent | next [-] | | The benchmarks in the model card are purported to be measurements of model quality (ability to perform tasks with few errors), not speed. They almost certainly run these benchmarks on their own cloud infrastructure (Alibaba afaik), which is typically not hardware that even the most enthusiastic homelab hobbyist can afford. | |
| ▲ | Aurornis 3 hours ago | parent | prev [-] | | The benchmarks are from the unquantized model they release. This will only run on server hardware, some workstation GPUs, or some 128GB unified memory systems. It’s a situation where if you have to ask, you can’t run the exact model they released. You have to wait for quantizations to smaller sizes, which come in a lot of varieties and have quality tradeoffs. | | |
|
|
|
| ▲ | jauntywundrkind 2 hours ago | parent | prev | next [-] |
| I would detest the time/words it takes to hand hold through such a review, of teaching folks the basics about LLM like this. It's also a section that, with hope, becomes obsolete sometime semi soon-ish. |
|
| ▲ | seffignoz 3 hours ago | parent | prev [-] |
| [dead] |