Remix.run Logo
wonderwhyer 13 hours ago

I agree! My "dream" way to do it is closer to how Aider Leaderboard works but even bit better. To have GDPEval like set to tasks but you have information across all tasks and all models of how much time/tokens/money/quality you get from particular model on particular task. I was thinking to do evals against skills in that sense.

But that is huge and expensive project. Only "approximation" I could pull of reasonably to get this started was to use benchmark scores as "surrogate" for that.

But working on a way to get this going. If you have additional thoughts on how to approach this I it would be super valuable.