| ▲ | redhale 8 hours ago | |
I agree with your take. I don't really see why evals are assumed to be exclusively in the domain of data scientists. In my experience SWEs-turned-AI Engineers are much better suited to building agents. Some struggle more than others, but "evals as automated tests" is, imo, so obvious a mental model, and can be so well adapted to by good SWEs, that data scientists have no real role on many "agent" projects. I'm not saying this is good or bad, just that it's what I'm observing in practice. For context, I'm a SWE-turned-AI Engineer, so I may be biased :) | ||
| ▲ | samusiam 7 hours ago | parent [-] | |
I think there's a lot of methodological expertise that goes into collecting good eval data. For example, in many cases you need human labelers with the right expertise, well designed tasks, well defined constructs, and you need to hit interrater agreement targets and troubleshoot when you don't. Good label data is a prerequisite to the stuff that can probably be automated by the AI agent (improving the system to optimize a metric measured against ground truth labels). Data scientists and research scientists are more likely to have this skillset. And it takes time to pick up and learn the nuances. | ||