Remix.run Logo
postalcoder 3 hours ago

First thoughts using gpt-5.3-codex-spark in Codex CLI:

Blazing fast but it definitely has a small model feel.

It's tearing up bluey bench (my personal agent speed benchmark), which is a file system benchmark where I have the agent generate transcripts for untitled episodes of a season of bluey, perform a web search to find the episode descriptions, and then match the transcripts against the descriptions to generate file names and metadata for each episode.

Downsides:

- It has to be prompted to do actions in my media library AGENTS.md that the larger models adhere to without additional prompting.

- It's less careful with how it handles context which means that its actions are less context efficient. Combine that with the smaller context window and I'm seeing frequent compactions.

  Bluey Bench* (minus transcription time):

  Codex CLI
  gpt-5.3-codex-spark low        20s
  gpt-5.3-codex-spark medium     41s
  gpt-5.3-codex-spark xhigh   1m 09s (1 compaction)

  gpt-5.3-codex low           1m 04s
  gpt-5.3-codex medium        1m 50s

  gpt-5.2 low                 3m 04s
  gpt-5.2 medium              5m 20s

  Claude Code
  opus-4.6 (no thinking)      1m 04s

  Antigravity
  gemini-3-flash              1m 40s
  gemini-3-pro low            3m 39s

  *Season 2, 52 episodes
alexdobrenko an hour ago | parent | next [-]

can we plese make the bluey bench the gold standard for all models always

mnicky 2 hours ago | parent | prev | next [-]

Can you compare it to Opus 4.6 with thinking disabled? It seems to have very impressive benchmark scores. Could also be pretty fast.

postalcoder 2 hours ago | parent [-]

Added a thinking-disabled Opus 4.6 timing. It took 1m 4s – coincidentally the same as 5.3-codex-low.

Squarex 2 hours ago | parent | prev [-]

I wonder why they named it so similiarly to the normal codex model while it much worse, while cool of course.