| ▲ | throwdbaaway 3 hours ago |
| There are Qwen3.5 27B quants in the range of 4 bits per weight, which fits into 16G of VRAM. The quality is comparable to Sonnet 4.0 from summer 2025. Inference speed is very good with ik_llama.cpp, and still decent with mainline llama.cpp. |
|
| ▲ | codemog 2 hours ago | parent | next [-] |
| Can someone explain how a 27B model (quantized no less) ever be comparable to a model like Sonnet 4.0 which is likely in the mid to high hundreds of billions of parameters? Is it really just more training data? I doubt it’s architecture improvements, or at the very least, I imagine any architecture improvements are marginal. |
| |
| ▲ | revolvingthrow 11 minutes ago | parent | next [-] | | It doesn’t. I’m not sure it outperforms chatgpt 3 | |
| ▲ | spwa4 10 minutes ago | parent | prev | next [-] | | The short answer is that there are more things that matter than parameter count, and we are probably nowhere near the most efficient way to make these models. | |
| ▲ | otabdeveloper4 an hour ago | parent | prev [-] | | There's diminishing returns bigly when you increase parameter count. The sweet spot isn't in the "hundreds of billions" range, it's much lower than that. Anyways your perception of a model's "quality" is determined by careful post-training. | | |
| ▲ | codemog 43 minutes ago | parent | next [-] | | Interesting. I see papers where researchers will finetune models in the 7 to 12b range and even beat or be competitive with frontier models. I wish I knew how this was possible, or had more intuition on such things. If anyone has paper recommendations, I’d appreciate it. | |
| ▲ | zozbot234 an hour ago | parent | prev [-] | | More parameters improves general knowledge a lot, but you have to quantize higher in order to fit in a given amount of memory, which if taken to extremes leads to erratic behavior. For casual chat use even Q2 models can be compelling, agentic use requires more regularization thus less quantized parameters and lowering the total amount to compensate. |
|
|
|
| ▲ | zozbot234 2 hours ago | parent | prev | next [-] |
| With MoE models, if the complete weights for inactive experts almost fit in RAM you can set up mmap use and they will be streamed from disk when needed. There's obviously a slowdown but it is quite gradual, and even less relevant if you use fast storage. |
|
| ▲ | teaearlgraycold 2 hours ago | parent | prev [-] |
| Qwen3.5 35B A3B is much much faster and fits if you get a 3 bit version. How fast are you getting 27B to run? On my M3 Air w/ 24GB of memory 27B is 2 tok/s but 35B A3B is 14-22 tok/s which is actually usable. |
| |
| ▲ | throwdbaaway 33 minutes ago | parent | next [-] | | Using ik_llama.cpp to run a 27B 4bpw quant on a RTX 3090, I get 1312 tok/s PP and 40.7 tok/s TG at zero context, dropping to 1009 tok/s PP and 36.2 tok/s TG at 40960 context. 35B A3B is faster but didn't do too well in my limited testing. | |
| ▲ | ece 2 hours ago | parent | prev [-] | | The 27B is rated slightly higher for SWE-bench. |
|