| ▲ | AI2: Open Coding Agents(allenai.org) | ||||||||||||||||||||||||||||
| 46 points by publicmatt 4 hours ago | 13 comments | |||||||||||||||||||||||||||||
| ▲ | ahmadyan an hour ago | parent | next [-] | ||||||||||||||||||||||||||||
Claims in the article are incorrect. They conveniently ignore Meta CWM models, which are open-sourced [1] and open-weight [2] and are at 65% SWE-bench verified (with TTS) and 54% pass@1 and the same size (32B dense). So claims like "surpassing prior open-source state-of-the-art coding models of comparable sizes and context lengths" and conveniently leaving out the previous OSS SOTA out of your eval tables are ... sketch. [1]https://github.com/facebookresearch/cwm [2]https://huggingface.co/facebook/cwm | |||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||
| ▲ | augusteo 41 minutes ago | parent | prev | next [-] | ||||||||||||||||||||||||||||
The ahmadyan comparison is fair. Meta's CWM models hitting 65% vs SERA's 54% is a meaningful gap. But the interesting number here isn't accuracy. It's the $400 to reproduce top open-source performance. That's the part that matters for teams building internal tooling. We've been running agents on proprietary codebases at work. The pain isn't model quality. It's customization. Most off-the-shelf agents don't understand your repo structure, your conventions, your test patterns. If you can fine-tune a 32B model on your own codebase for a few hundred dollars, that changes the economics completely. But codebases changes everyday, so finetuning will have to be continuously done! Probably not worth it versus something like Claude Code. Curious whether anyone's tried this on non-Python codebases. Most SWE-Bench stuff is Python-heavy. | |||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||
| ▲ | nickandbro 38 minutes ago | parent | prev | next [-] | ||||||||||||||||||||||||||||
Great work! Really respect AI2. they open source everything. The model, the weights, the training pipeline, inference stack, and corpus | |||||||||||||||||||||||||||||
| ▲ | Imustaskforhelp 27 minutes ago | parent | prev | next [-] | ||||||||||||||||||||||||||||
Hey this looks great? Is it available on Openrouter. I wish if AI2 could release a more denser model on Openrouter for free than the 8B model as I was using Devstral model for agentic purposes. If we can get an agentic good 32B like model on openrouter for ~free, then I feel like it will be very interesting to see how things would go imo. Good luck with AI2! The premise of truly open source models is really interesting and I feel like it could help bring more innovation in the space imo! | |||||||||||||||||||||||||||||
| ▲ | jauntywundrkind an hour ago | parent | prev | next [-] | ||||||||||||||||||||||||||||
Awesome stuff. Output speed looks crazy fast too. I wonder if this indeed will start prompting more language specific work. Afaik training still requires not just looking at sample code but also being able to write loss functions being able to have problems the AI can work at. That seems hard. One random thought, are there training styles of just deleting some code from "good" projects then making the AI make it work again? | |||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||
| ▲ | khimaros an hour ago | parent | prev [-] | ||||||||||||||||||||||||||||
it's great to see this kind of progress in reproducible weights, but color me confused. this claims to be better and smaller than Devstral-Small-2-24B, while clocking in at 32B (larger) and scoring more poorly? | |||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||