Remix.run Logo
ethanpil 4 hours ago

The table comparing eval scores shows the following:

Agentic Terminal Coding (Terminal-Bench 2.1) Opus 4.8 74.6% GPT 5.5 78.2%

Then, when you scroll all the way down to the bottom Footnotes section it says

"Terminal-Bench 2.1: We reported scores for all models using the Terminus-2 public harness. GPT-5.5’s reported score with the Codex CLI harness is 83.4%."

fastball 3 hours ago | parent [-]

Seems reasonable? Presumably Claude also performs better under the Claude Code harness.