Remix.run Logo
derbOac 15 hours ago

"Quality metrics" need much more discussion and attention, in my opinion.

Not a criticism of this project — it's a good idea, it just highlights the central question of "how well is this model working?" I'm not sure it's so straightforward.

wonderwhyer 13 hours ago | parent [-]

I agree! My "dream" way to do it is closer to how Aider Leaderboard works but even bit better. To have GDPEval like set to tasks but you have information across all tasks and all models of how much time/tokens/money/quality you get from particular model on particular task. I was thinking to do evals against skills in that sense.

But that is huge and expensive project. Only "approximation" I could pull of reasonably to get this started was to use benchmark scores as "surrogate" for that.

But working on a way to get this going. If you have additional thoughts on how to approach this I it would be super valuable.