Remix clone Hacker News

new | show | ask | jobs Github

	▲	smokel 5 hours ago
		I'll bite. The benchmark is actually pretty good. It shows in an extremely comprehensible way how far LLMs have come. Someone not in the know has a hard time understanding what 65.4% means on "Terminal-Bench 2.0". Comparing some crappy pelicans on bicycles is a lot easier.
	▲	blibble 3 hours ago \| parent [-]
		it ceases to be a useful benchmark of general ability when you post it publicly for them to train against