| ▲ | Tiberium 7 hours ago | ||||||||||||||||||||||
I highly doubt some of those results, GPT 5.2/+codex is incredible for cyber security and CTFs, and 5.3 Codex (not on API yet) even moreso. There is absolutely no way it's below Deepseek or Haiku. Seems like a harness issue, or they tested those models at none/low reasoning? | |||||||||||||||||||||||
| ▲ | jakozaur 7 hours ago | parent | next [-] | ||||||||||||||||||||||
As I do eval and training data sets for living, in niche skills, you can find plenty of surprises. The code is open-source; you can run it yourself using Harbor Framework: git clone git@github.com:QuesmaOrg/BinaryAudit.git export OPENROUTER_API_KEY=... harbor run --path tasks --task-name lighttpd-* --agent terminus-2 --model openrouter/anthropic/claude-opus-4.6 --model openrouter/google/gemini-3-pro-preview --model openrouter/openai/gpt-5.2 --n-attempts 3 Please open PR if you find something interesting, though our domain experts spend fair amount of time looking at trajectories. | |||||||||||||||||||||||
| |||||||||||||||||||||||
| ▲ | stared 2 hours ago | parent | prev [-] | ||||||||||||||||||||||
To be honest, it is also our surprise. I mean, I used GPT 5.2 Codex in Cursor for decompiling an old game and it worked (way better than Claude Code with Opus 4.5). We tested for Opus 4.6, but waiting for public API to test on GPT 5.3 Codex. At the same time, various task can be different, and now all things that work the best end-to-end are the same as ones that are good for a typical, interactive workflow. We used Terminus 2 agent, as it is the default used by Harbor (https://harborframework.com/), as we want to be unbiased. Very likely other frameworks will change the result. | |||||||||||||||||||||||