| ▲ | terhechte 5 hours ago | |
Thank you for NeoVim! I also use it every day, mostly for thinking / text / markdown though these days. Have you compared against MLX? Sometimes I’m getting much faster responses but it feels like the quality is worse (eg tool calls not working, etc) | ||
| ▲ | tarruda 4 hours ago | parent [-] | |
> Have you compared against MLX? I don't think MLX supports similar 2-bit quants, so I never tried 397B with MLX. However I did try 4-bit MLX with other Qwen 3.5 models and yes it is significantly faster. I still prefer llama.cpp due to it being a one in all package: - SOTA dynamic quants (especially ik_llama.cpp) - amazing web ui with MCP support - anthropic/openai compatible endpoints (means it can be used with virtually any harness) - JSON constrained output which basically ensures tool call correctness. - routing mode | ||