Remix.run Logo
jumploops 3 hours ago

If you used GPT-5.5 over the last 24 hours or so, you may have already had access to 5.6.

I've been running some tests on a harness we're building, and suddenly saw a jump in a few points yesterday. I reran the vanilla codex benchmark and saw an ~88% score on Terminal Bench 2.1 from GPT-5.5 on vanilla Codex.

The biggest indicator, beyond the score, was that 3 tests which frequently hit "safety" blockers with 5.5 started succeeding last night without warning.

2 hours ago | parent | next [-]
[deleted]
hhh an hour ago | parent | prev | next [-]

these things can just change with infrastructure changes rather than be some mysterious A/B testing.

dakolli an hour ago | parent | prev | next [-]

Did you even read the release, it wasn't broadly released to anyone..

At their request, we are starting with a limited preview for a small group of trusted partners whose participation has been shared with the government, before releasing more broadly.

This comment is an excellent example why the average llm user is basically a slot machine user who thinks "this one is hot, this one is lucky, this one is better than the others" and constantly switching between models on a whim of some occulted understanding that only they posses.

Also, who cares about some 80% benchmark.. They train on these public benchmarks in order to impress people like yourself that subscribe meaning to them. How come they only get 4% pass on $20-30/hr Upwork tasks? It seems to me like these benchmarks are basically useless... There's a thing called variance, I'm not sure why a higher scores on a few tests would lead you to believe you have access to a model that they say you don't have access too..

https://labs.scale.com/leaderboard/rli

jumploops 12 minutes ago | parent | prev [-]

[dead]