| ▲ | Palmik 9 hours ago | |
All evals on Terminal Bench require some harness. :) Or "Agent", as Terminal Bench calls it. Presumably the Gemini 3 are using Gemini CLI. What do you mean by "standard eval harness"? | ||
| ▲ | lucassz 44 minutes ago | parent [-] | |
I think the point is that it looks like Gemini 3 was only tested with the generic "Terminus 2", whereas Codex was tested with the Codex CLI. | ||