Remix clone Hacker News

new | show | ask | jobs Github

	▲	jumploops 3 hours ago
		If you used GPT-5.5 over the last 24 hours or so, you may have already had access to 5.6. I've been running some tests on a harness we're building, and suddenly saw a jump in a few points yesterday. I reran the vanilla codex benchmark and saw an ~88% score on Terminal Bench 2.1 from GPT-5.5 on vanilla Codex. The biggest indicator, beyond the score, was that 3 tests which frequently hit "safety" blockers with 5.5 started succeeding last night without warning.
	▲	2 hours ago \| parent \| next [-]
		[deleted]
	▲	hhh an hour ago \| parent \| prev \| next [-]
		these things can just change with infrastructure changes rather than be some mysterious A/B testing.
	▲	dakolli an hour ago \| parent \| prev \| next [-]
		Did you even read the release, it wasn't broadly released to anyone.. At their request, we are starting with a limited preview for a small group of trusted partners whose participation has been shared with the government, before releasing more broadly. This comment is an excellent example why the average llm user is basically a slot machine user who thinks "this one is hot, this one is lucky, this one is better than the others" and constantly switching between models on a whim of some occulted understanding that only they posses. Also, who cares about some 80% benchmark.. They train on these public benchmarks in order to impress people like yourself that subscribe meaning to them. How come they only get 4% pass on $20-30/hr Upwork tasks? It seems to me like these benchmarks are basically useless... There's a thing called variance, I'm not sure why a higher scores on a few tests would lead you to believe you have access to a model that they say you don't have access too.. https://labs.scale.com/leaderboard/rli
	▲	jumploops 12 minutes ago \| parent \| prev [-]
		[dead]