| ▲ | jakozaur 7 hours ago | |||||||
As I do eval and training data sets for living, in niche skills, you can find plenty of surprises. The code is open-source; you can run it yourself using Harbor Framework: git clone git@github.com:QuesmaOrg/BinaryAudit.git export OPENROUTER_API_KEY=... harbor run --path tasks --task-name lighttpd-* --agent terminus-2 --model openrouter/anthropic/claude-opus-4.6 --model openrouter/google/gemini-3-pro-preview --model openrouter/openai/gpt-5.2 --n-attempts 3 Please open PR if you find something interesting, though our domain experts spend fair amount of time looking at trajectories. | ||||||||
| ▲ | Tiberium 6 hours ago | parent | next [-] | |||||||
Just for fun, I ran dnsmasq-backdoor-detect-printf (which has a 0% pass rate in your leaderboard with GPT models) with --agent codex instead of terminus-2 with gpt-5.2-codex and it identified the backdoor successfully on the first try. I honestly think it's a harness issue, could you re-run the benchmarks with Codex for gpt-5.2-codex and gpt-5.2? | ||||||||
| ▲ | Tiberium 6 hours ago | parent | prev [-] | |||||||
Are the existing trajectories from your runs published anywhere? Or is the only way is for me to run them again? | ||||||||
| ||||||||