Remix clone Hacker News

new | show | ask | jobs Github

	▲	ethanpil 4 hours ago
		The table comparing eval scores shows the following: Agentic Terminal Coding (Terminal-Bench 2.1) Opus 4.8 74.6% GPT 5.5 78.2% Then, when you scroll all the way down to the bottom Footnotes section it says "Terminal-Bench 2.1: We reported scores for all models using the Terminus-2 public harness. GPT-5.5’s reported score with the Codex CLI harness is 83.4%."
	▲	fastball 3 hours ago \| parent [-]
		Seems reasonable? Presumably Claude also performs better under the Claude Code harness.