▲ | sudhirb 4 days ago | |
For coding agents, evaluations are tricky - thorough evaluation tasks tend to be slow and/or expensive and/or display a high degree of variance over N attempts. You could run a whole benchmark like SWE Bench or Terminal Bench against a coding agent on every change but it quickly becomes infeasible. | ||
▲ | roadside_picnic 4 days ago | parent | next [-] | |
I used to own the eval suite for a coding agent, it's certainly doable, even when it requires SQL + tables etc. We even had support for a wide range of data options ranging from canned csv data to plugging into prod to simulate the user experience, all easily configurable at eval run time. It also supported agentic flows where the results from one eval could be chained to the next (with a known correct answer being an optional send to check the framework end to end in the case of node failure). Interestingly enough, we started with hundreds of evals, but after that experience my advice has become: less evals tied more closely to specific features and product ambitions. By that I mean: some evals should serve as a warning ("uh oh, that eval failed, don't push to prod"), others as a mile stone ("woohoo! we got it work!"), and all should be informed by the product road map. You basically should understand where the product is going just by looking over the eval suite. And, if you don't have evals, you really don't know if you're moving the needle at all. There were multiple situations where a tweak to a prompt passed an initial vibe check, but when run against the full eval suite, clearly performed worse. The other piece of advice would be: evals don't have to sophisticated, just repeatable and agnostic to who's running them. Heck even "vibe checks" can be good evals, if they're written down and they need to pass some consensus among multiple people around whether they passed or not. | ||
▲ | criemen 3 days ago | parent | prev [-] | |
Running evals aren't the problem, the problem is acquiring or building a high-quality, non-contaminated dataset. https://arxiv.org/abs/2506.12286 makes a very compelling case that swebench (and in extension, anything that's based on public source code) is most likely overestimating your agents actual capabilities. |