| ▲ | yoan9224 7 hours ago | |
The key insight from this benchmark is using "human-equivalent hours" rather than actual AI execution time. It's measuring capability complexity, not speed. What's interesting is the 50% vs 80% reliability gap. At 50% success rate on a 4-hour task, you're essentially gambling. If it fails, you've potentially wasted the 4 hours plus the time debugging why it failed. This is why I think the current "agent" paradigm needs human checkpoints at regular intervals. Let the AI work for 30 minutes, then review progress. Repeat. This way you catch drift early before it compounds. The other thing missing from these benchmarks: recovery ability. When the AI gets stuck on hour 3 of a 4-hour task, can it recognize the problem and backtrack? Or does it confidently continue down the wrong path? | ||
| ▲ | dvfjsdhgfv 7 hours ago | parent [-] | |
> This is why I think the current "agent" paradigm needs human checkpoints at regular intervals. Let the AI work for 30 minutes, then review progress. Repeat. This way you catch drift early before it compounds. The problem with this approach is that in 30 minutes, an agent is able to produce a massive amount of stuff. Reviewing all this is a nightmare, in the sense that on the surface it seems fine and it often works, until it doesn't. The bugs introduced are often subtle and their effects manifest later, if ever. So, for stuff that matters (to me), I prefer not to use agents at all. Maybe things will change in a year, or 5, or 10. I will be giving it a try. but for the moment it's just not worth it, and the upside-down workflow it pushes on me is just making me tired and lose satisfaction from doing my job. | ||