Remix clone Hacker News

new | show | ask | jobs Github

	▲	karmasimida 4 hours ago
		For those who cared: GPT-5.3-Codex dominates terminal coding with a roughly 12% lead (Terminal-Bench 2.0), while Opus 4.6 retains the edge in general computer use by 8% (OSWorld). Anyone knows the difference between OSWorld vs OSWorld Verified?
	▲	nopinsight 3 hours ago \| parent [-]
		From Claude 4.6 Thinking: OSWorld is the full 369-task benchmark. OSWorld Verified is a ~200-task subset where humans have confirmed the eval scripts reliably score success/failure — the full set has some noisy grading where correct actions can still get marked wrong. Scores on Verified tend to run higher, so they're not directly comparable.