I see the value in that, but there are a few reasons that isn't on the immediate roadmap -- mainly, it shifts focus from measuring the model to measuring the harness. The agentic benchmark section you see on the site is comparable to how an agent would perform using an open harness like Pi. But latest tool-using models are pretty well adapted to any harness, so I think that's less of a factor in overall model performance.

▲

wahnfrieden a day ago | parent [-]

Just fresh on my mind after reading this from Codex team member re: performance difference between Pi and Codex app server usage: https://x.com/pashmerepat/status/2046865863979172039

▲

ZeroGravitas a day ago | parent [-]

Well that couldn't be vaguer if he tried. Basically saying, our stuff is better, no reasons given.

	▲	wahnfrieden 14 hours ago \| parent [-]
		Yeah that's why I'm advocating for measuring it in this thread. Some of these models are trained specifically for their official harnesses