| ▲ | Aurornis 3 hours ago | ||||||||||||||||||||||||||||||||||||||||||||||||||||
If you're new to this: All of the open source models are playing benchmark optimization games. Every new open weight model comes with promises of being as good as something SOTA from a few months ago then they always disappoint in actual use. I've been playing with Qwen3-Coder-Next and the Qwen3.5 models since they were each released. They are impressive, but they are not performing at Sonnet 4.5 level in my experience. I have observed that they're configured to be very tenacious. If you can carefully constrain the goal with some tests they need to pass and frame it in a way to keep them on track, they will just keep trying things over and over. They'll "solve" a lot of these problems in the way that a broken clock is right twice a day, but there's a lot of fumbling to get there. That said, they are impressive for open source models. It's amazing what you can do with self-hosted now. Just don't believe the hype that these are Sonnet 4.5 level models because you're going to be very disappointed once you get into anything complex. | |||||||||||||||||||||||||||||||||||||||||||||||||||||
| ▲ | kir-gadjello 2 hours ago | parent | next [-] | ||||||||||||||||||||||||||||||||||||||||||||||||||||
Respectfully, from my experience and a few billions of tokens consumed, some opensource models really are strong and useful. Specifically StepFun-3.5-flash https://github.com/stepfun-ai/Step-3.5-Flash I'm working on a pretty complex Rust codebase right now, with hundreds of integration tests and nontrivial concurrency, and stepfun powers through. I have no relation to stepfun, and I'm saying this purely from deep respect to the team that managed to pack this performance in 196B/11B active envelope. | |||||||||||||||||||||||||||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||||||||||||||||||||||||||
| ▲ | wolvoleo 2 hours ago | parent | prev | next [-] | ||||||||||||||||||||||||||||||||||||||||||||||||||||
All models are doing that. Not only the open source ones. I bet the cloud ones are doing it a lot more because they can also affect the runtime side which the open source ones can't. | |||||||||||||||||||||||||||||||||||||||||||||||||||||
| ▲ | dimgl 29 minutes ago | parent | prev | next [-] | ||||||||||||||||||||||||||||||||||||||||||||||||||||
I'm using Qwen 3.5 27b on my 4090 and let me tell you. This is the first time I am seriously blown away by coding performance on a local model. They are almost always unusable. Not this time though... | |||||||||||||||||||||||||||||||||||||||||||||||||||||
| ▲ | chaboud 2 hours ago | parent | prev | next [-] | ||||||||||||||||||||||||||||||||||||||||||||||||||||
"When a measure becomes a target, it ceases to be a good measure." Goodhart's law shows up with people, in system design, in processor design, in education... Models are going to be over-fit to the tests unless scruples or practical application realities intervene. It's a tale as old as machine learning. | |||||||||||||||||||||||||||||||||||||||||||||||||||||
| ▲ | rudhdb773b an hour ago | parent | prev | next [-] | ||||||||||||||||||||||||||||||||||||||||||||||||||||
Are there any up-to-date offline/private agentic coding benchmark leaderboards? If the tests haven't been published anywhere and are sufficiently different from standard problems, I would think the benchmarks would be robust to intentional over optimization. Edit: These look decent and generally match my expectations: | |||||||||||||||||||||||||||||||||||||||||||||||||||||
| ▲ | noosphr 2 hours ago | parent | prev | next [-] | ||||||||||||||||||||||||||||||||||||||||||||||||||||
It's not just the open source ones. The only benchmarks worth anything are dynamic ones which can be scaled up. | |||||||||||||||||||||||||||||||||||||||||||||||||||||
| ▲ | amelius 3 hours ago | parent | prev | next [-] | ||||||||||||||||||||||||||||||||||||||||||||||||||||
Are you saying that the benchmarks are flawed? And could quantization maybe partially explain the worse than expected results? | |||||||||||||||||||||||||||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||||||||||||||||||||||||||
| ▲ | eurekin 2 hours ago | parent | prev | next [-] | ||||||||||||||||||||||||||||||||||||||||||||||||||||
Very good point. I'm playing with them too and got to the same conclusion. | |||||||||||||||||||||||||||||||||||||||||||||||||||||
| ▲ | crystal_revenge 2 hours ago | parent | prev | next [-] | ||||||||||||||||||||||||||||||||||||||||||||||||||||
> they always disappoint in actual use. I’ve switched to using Kimi 2.5 for all of my personal usage and am far from disappointed. Aside from being much cheaper than the big names (yes, I’m not running it locally, but like that I could) it just works and isn’t a sycophant. Nice to get coding problems solved without any “That’s a fantastic idea!”/“great point” comments. At least with Kimi my understanding is that beating benchmarks was a secondary goal to good developer experience. | |||||||||||||||||||||||||||||||||||||||||||||||||||||
| ▲ | jackblemming 3 hours ago | parent | prev | next [-] | ||||||||||||||||||||||||||||||||||||||||||||||||||||
Death by KPIs. Management makes it too risky to do anything but benchmaxx. It will be the death of American AI companies too. Eventually, people will notice models aren’t actually getting better and the money will stop flowing. However, this might be a golden age of research as cheap GPUs flood the market and universities have their own clusters. | |||||||||||||||||||||||||||||||||||||||||||||||||||||
| ▲ | bourjwahwah 2 hours ago | parent | prev [-] | ||||||||||||||||||||||||||||||||||||||||||||||||||||
[dead] | |||||||||||||||||||||||||||||||||||||||||||||||||||||