| ▲ | jumploops 3 hours ago | |
If you used GPT-5.5 over the last 24 hours or so, you may have already had access to 5.6. I've been running some tests on a harness we're building, and suddenly saw a jump in a few points yesterday. I reran the vanilla codex benchmark and saw an ~88% score on Terminal Bench 2.1 from GPT-5.5 on vanilla Codex. The biggest indicator, beyond the score, was that 3 tests which frequently hit "safety" blockers with 5.5 started succeeding last night without warning. | ||
| ▲ | 2 hours ago | parent | next [-] | |
| [deleted] | ||
| ▲ | hhh an hour ago | parent | prev | next [-] | |
these things can just change with infrastructure changes rather than be some mysterious A/B testing. | ||
| ▲ | dakolli an hour ago | parent | prev | next [-] | |
Did you even read the release, it wasn't broadly released to anyone.. At their request, we are starting with a limited preview for a small group of trusted partners whose participation has been shared with the government, before releasing more broadly. This comment is an excellent example why the average llm user is basically a slot machine user who thinks "this one is hot, this one is lucky, this one is better than the others" and constantly switching between models on a whim of some occulted understanding that only they posses. Also, who cares about some 80% benchmark.. They train on these public benchmarks in order to impress people like yourself that subscribe meaning to them. How come they only get 4% pass on $20-30/hr Upwork tasks? It seems to me like these benchmarks are basically useless... There's a thing called variance, I'm not sure why a higher scores on a few tests would lead you to believe you have access to a model that they say you don't have access too.. | ||
| ▲ | jumploops 12 minutes ago | parent | prev [-] | |
[dead] | ||