Remix clone Hacker News

new | show | ask | jobs Github

	▲	yaodub 2 hours ago
		SWE-Bench measures single tasks in isolation. In a real loop the model usually loses track of what I was trying to do long before code quality becomes the issue.