Remix.run Logo
languid-photic 5 hours ago

Naively tested a set of agents on this task.

Each ran the same spec headlessly in their native harness (one shot).

Results:

    Agent                        Cycles     Time
    ─────────────────────────────────────────────
    gpt-5-2                      2,124      16m
    claude-opus-4-5-20251101     4,973      1h 2m
    gpt-5-1-codex-max-xhigh      5,402      34m
    gpt-5-codex                  5,486      7m
    gpt-5-1-codex                12,453     8m
    gpt-5-2-codex                12,905     6m
    gpt-5-1-codex-mini           17,480     7m
    claude-sonnet-4-5-20250929   21,054     10m
    claude-haiku-4-5-20251001    147,734    9m
    gemini-3-pro-preview         147,734    3m
    gpt-5-2-codex-xhigh          147,734    25m
    gpt-5-2-xhigh                147,734    34m
Clearly none beat Anthropic's target, but gpt-5-2 did slightly better in much less time than "Claude Opus 4 after many hours in the test-time compute harness".
lawrencechen 4 hours ago | parent | next [-]

codex cli + gpt-5-2-codex-xhigh got to 1606 with the prompt "beat 1487 cycles. go." ~53 minutes.

dudewhocodes 13 minutes ago | parent | next [-]

Serious prompt engineering right here

jstummbillig 4 hours ago | parent | prev | next [-]

Will you look at this man's prompting skills?!

4 hours ago | parent | prev [-]
[deleted]
raphaelj an hour ago | parent | prev | next [-]

Could you try with some open-weighted models, e.g. Qwen3-coder, GLM-4.7 or Devstral-2?

ponyous 4 hours ago | parent | prev | next [-]

Very interesting thanks! I wonder what would happen if you kept running Gemini in a loop for a while. Considering how much faster it ended it seems like there is a lot more potential.

a24j 3 hours ago | parent | prev | next [-]

Can you share the agent-comparison harness code or point to something similar? I want to learn about benchmarking models in a basic or practical sense.

forgotpwd16 4 hours ago | parent | prev | next [-]

Could you make a repo with solutions given by each model inside a dir/branch for comparison?

kitrak95 4 hours ago | parent [-]

Are you giving instructions to a stranger on the internet?

forgotpwd16 3 hours ago | parent | next [-]

Instructions?! Just asked since GP already did it. No need to realize top comment's "DDOS attack on other AI companies" joke.

edf13 3 hours ago | parent | prev [-]

I think he’s asking rather than giving instructions

pelagicAustral 3 hours ago | parent [-]

He's prompting

giancarlostoro 4 hours ago | parent | prev [-]

I do wonder how Grok would compare, specifically their Claude Code Fast model.