Remix clone Hacker News

new | show | ask | jobs Github

	▲	irthomasthomas 2 hours ago
		Anthropic has again changed the set of benchmarks they use[0]. This time they have also moved all benchmark scores to the PDF. At a glance it looks like it gains about ~5-10% over other models. the speed is about the same as opus >=4.5, sonnet 4.5, and double the speed of opus <=4.1 Mythos 5 Fable 5 MythosPrev Opus 4.8 GPT-5.5 Gemini 3.1 Pro SWE-bench Pro 80.3 80 77.8 69.2 58.6 54.2 SWE-bench Ver 95.5 95 93.9 88.6 - 80.6 Terminal-Bench 88.0 84.3 - 82.7 83.4 - BrowseComp (Single-Agent) 88.0 - 87.9 84.3 84.4 85.9 BrowseComp (Multi-Agent) 93.3 - - 88.5 - - HLE (No tools) 59.0 - 56.8 49.8 41.4 44.4 HLE (Tools) 64.5 - 64.7 57.9 52.2 51.4 CharXiv Reasoning (No tools) 88.9 - 86.2 80.5 - - CharXiv Reasoning (Tools) 93.5 - 92.5 89.9 - - BioMystery Bench (Human) 83.9 - 82.6 80.4 - - BioMystery Bench (Hard) 46.1 - 29.6 40.0 - - OSWorld-Verified 85.0 85.0 85.4 83.4 78.7 76.2* CritPt 28.6 - 20.9 27.1 17.7 - ArxivMath 78.5 68.7 71.8 71.5 64.0 - [0] https://news.ycombinator.com/item?id=48312633 Edit: Also in the system card... "we’ve implemented new interventions that limit Claude’s effectiveness for requests targeting frontier LLM development (for example, on building pretraining pipelines, distributed training infrastructure, or ML accelerator design). ... Unlike our interventions for cybersecurity, biology and chemistry, and distillation attempts, these safeguards will not be visible to the user."
	▲	charles_f an hour ago \| parent [-]
		It's announced as a revolution but when you look at those benchmarks it surely looks like an iteration.