| ▲ | Paddyz 2 hours ago | |
The 35b-a3b model is misleading in its naming - it's a MoE with only 3B active parameters per forward pass. You're essentially running a 3B-class model for inference quality while paying the memory cost of loading 35B parameters. That's why it feels so much worse than Opus or Gemini, which are likely 10-100x larger in effective compute per token. For your M3 Max 128G setup, try Qwen3.5-122B-A10B with a 4-bit quantization instead (should fit in ~50-60GB). 10B active params is a massive step up from 3B and you'll actually see the quality difference people are talking about. MLX versions specifically optimized for Apple Silicon will also give you noticeably better tok/s than running through ollama. The general rule I've settled on: MoE models with <8B active params are great for structured tasks (reformatting, classification, simple completions) but fall apart on anything requiring deep reasoning or domain knowledge. For your research question use case, you want either a dense 27B+ model or a MoE with 10B+ active params. | ||