Remix clone Hacker News

new | show | ask | jobs Github

	▲	prmph 8 hours ago
		So many things to think about regarding these "benchmarks": - Do the ever increasing scores on the mean we will soon have models that approach 100%? And what would that even mean? That there is no more room for improvement? - Would Anthropic (or any other model vendor for that matter) ever release a newer model that scores lower? If not, does that mean they keep tweaking a new model they want to release until it shows an improvement of the prior model? - Would it be more useful to move toward a comparative rather than absolute ranking?