| ▲ | HereBePandas 11 hours ago |
| Not apples-to-apples. "Codex CLI (GPT-5.1-Codex)", which the site refers to, adds a specific agentic harness, whereas the Gemini 3 Pro seems to be on a standard eval harness. It would be interesting to see the apples-to-apples figure, i.e. with Google's best harness alongside Codex CLI. |
|
| ▲ | Palmik 10 hours ago | parent | next [-] |
| All evals on Terminal Bench require some harness. :) Or "Agent", as Terminal Bench calls it. Presumably the Gemini 3 are using Gemini CLI. What do you mean by "standard eval harness"? |
| |
| ▲ | lucassz 2 hours ago | parent [-] | | I think the point is that it looks like Gemini 3 was only tested with the generic "Terminus 2", whereas Codex was tested with the Codex CLI. |
|
|
| ▲ | enraged_camel 11 hours ago | parent | prev [-] |
| Do you mean that Gemini 3 Pro is "vanilla" like GPT 5.1 (non-Codex)? |
| |
| ▲ | HereBePandas 11 hours ago | parent [-] | | Yes, two things:
1. GPT-5.1 Codex is a fine tune, not the "vanilla" 5.1
2. More importantly, GPT 5.1 Codex achieves its performance when used with a specific tool (Codex CLI) that is optimized for GPT 5.1 Codex. But when labs evaluate the models, they have to use a standard tool to make the comparisons apples-to-apples. Will be interesting to see what Google releases that's coding-specific to follow Gemini 3. | | |
| ▲ | embedding-shape 7 hours ago | parent [-] | | > But when labs evaluate the models, they have to use a standard tool to make the comparisons apples-to-apples. That'd be a bad idea, models are often trained for specific tools (like GPT Codex is trained for Codex, and Sonnet has been trained with Claude Code in mind), and also vice-versa that the tools are built with a specific model in mind, as they all work differently. Forcing all the models to use the same tool for execution sounds like a surefire way of getting results that doesn't represent real usage, but instead arbitrarily measure how well a model works with the "standard harness", which if people start caring about, will start to become gamed instead. |
|
|