Remix.run Logo
Damianf19 4 hours ago

What's the data model that lets you compare agents that differ a lot in tools/policies? Curious if you normalize on the "what did the user actually accomplish" layer or on raw token/turn metrics, because the two paint completely different pictures of "is this agent working." We struggle with this on the eval side of our own product (email pipeline outcomes, not agents, but same shape).

alrudolph 4 hours ago | parent [-]

For the agent working, we're focusing on the user outcome, we think that the raw usage, number of turns, function calls are useful operationally but think of those as more observability than the core evaluation target. We do show some of these stats in our conversation view but don't aggregate to compare agents. Longer term we will look to add in more of these features so we can compare quality vs cost, for example