Remix.run Logo
siliconc0w 7 hours ago

Shameless self plug but also worried about the silent quality regressions, I started building a tool to track coding agent performance over time.. https://github.com/s1liconcow/repogauge

Here is a sample report that tries out the cheaper models + the newest Kimi2.6 model against the 5.4 'gold' testcases from the repo: https://repogauge.org/sample_report.

conception 6 hours ago | parent [-]

This is cool - just wanted to note https://marginlab.ai is one that has been around for a while.

aleksiy123 6 hours ago | parent [-]

are there any tools anyone knows to collect this kind of telemetry while using the tools instead of offline evals.

running evals seems like it may be a bit too expensive as a solo dev.