Remix clone Hacker News

new | show | ask | jobs Github

	▲	lordmauve 19 hours ago
		I don't know if DeepSWE is genuinely a good benchmark. It's more important that their analysis demolished the validity of SWE-Bench Pro, objectively: it is being mismarked. I think that buys enough credibility to propose an alternative. I think there's a case to answer if Anthropic models underperform on a novel benchmark. I'd like to see more novel benchmarks to get a clearer picture.