Remix clone Hacker News

new | show | ask | jobs Github

	▲	k2xl 8 hours ago
		Surprised to see SWE-Bench Pro only a slight improvement (57.7% -> 58.6%) while Opus 4.7 hit 64.3%. I wonder what Anthropic is doing to achieve higher scores on this - and also what makes this test particular hard to do well in compared to Terminal Bench (which 5.5 seemed to have a big jump in)
	▲	vexna 7 hours ago \| parent \| next [-]
		There's an asterisk right below that table stating that: > *Anthropic reported signs of memorization on a subset of problems And from the Anthropic's Opus 4.7 release page, it also states: > SWE-bench Verified, Pro, and Multilingual: Our memorization screens flag a subset of problems in these SWE-bench evals. Excluding any problems that show signs of memorization, Opus 4.7’s margin of improvement over Opus 4.6 holds.
	▲	conradkay 7 hours ago \| parent \| prev [-]
		Was 4.7 distilled off Mythos (which got 77.8%)? Interesting how mythos got 82% on terminal-bench 2.0 compared to 82.7% for GPT-5.5. Also notice how they state just for SWE-Bench Pro: "*Anthropic reported signs of memorization on a subset of problems"