Remix.run Logo
dakolli an hour ago

The only real world task benchmark I know of is Scale Labs RLI

https://labs.scale.com/leaderboard/rli

Its clear to me these models are useless on any real world task, a 4% pass rate on $20-30/hr Upwork tasks. This whole trend of agentic engineering is a giant money grab.