Do these benchmarks make any sense? I tried a few local models that seem to be scoring well in SWE but the results were pure rubbish. (For instance MiniMax-M2.5 at 128GB from unslothed - completely unusable).

▲

devnotes77 6 hours ago | parent | next [-]

SWE-bench scores well in the narrow task of making tests pass, which means models get good at exactly that. Real codebases have style constraints, architecture choices, and maintainability concerns that don't show up in any test suite. Not surprised at all that the PRs wouldn't get merged; you'd expect that from an eval that can't measure what reviewers actually care about.

▲

segmondy 5 hours ago | parent | prev [-]

Which quant? I find folks running lower quants complaining, yet they should be running higher quant. Qwen3CoderNext is great, even at Q6. I mistakenly had it loaded for an agentic workflow and was surprised at how well it is.

	▲	code_biologist 5 hours ago \| parent [-]
		What is "lower quant"? What is "higher quant"? I mean, I know what they are, but the very people you intend to reach don't know the difference between Q4_K_M and Q6_K and blog posts like [1] have nuggets like "For tests of the type ran here, there appear to be major diminishing returns past Q4". [1] https://big-stupid-jellyfish.github.io/GFMath/pages/llm-quan...