Remix.run Logo
Palmik 9 hours ago

All evals on Terminal Bench require some harness. :) Or "Agent", as Terminal Bench calls it. Presumably the Gemini 3 are using Gemini CLI.

What do you mean by "standard eval harness"?

lucassz 44 minutes ago | parent [-]

I think the point is that it looks like Gemini 3 was only tested with the generic "Terminus 2", whereas Codex was tested with the Codex CLI.