| ▲ | ethanpil 4 hours ago | |
The table comparing eval scores shows the following: Agentic Terminal Coding (Terminal-Bench 2.1) Opus 4.8 74.6% GPT 5.5 78.2% Then, when you scroll all the way down to the bottom Footnotes section it says "Terminal-Bench 2.1: We reported scores for all models using the Terminus-2 public harness. GPT-5.5’s reported score with the Codex CLI harness is 83.4%." | ||
| ▲ | fastball 3 hours ago | parent [-] | |
Seems reasonable? Presumably Claude also performs better under the Claude Code harness. | ||