| ▲ | SatvikBeri 5 hours ago | |
We can measure this by looking at the same harness applied to different models, e.g. the very plain Terminus: https://www.tbench.ai/leaderboard/terminal-bench/2.0?agents=... Models have improved dramatically even with the same harness | ||