Remix clone Hacker News

new | show | ask | jobs Github

	▲	siva7 6 hours ago
		Could it really be that not only we vibeslop all apps nowadays but also don't care to even check how ai solved a benchmark it claimed solved?
	▲	SpicyLemonZest 5 hours ago \| parent \| next [-]
		Frontier model developers try to check for memorization. But until AI interpretability is a fully solved problem, how can you really know whether it actually didn't memorize or your memorization check wasn't right?
	▲	retinaros 3 hours ago \| parent \| prev \| next [-]
		Every ai labs train on the test set. That is a big part of why we see benchmark climbing from 1% to 30% after a few models iterations
	▲	operatingthetan 6 hours ago \| parent \| prev [-]
		Probably a more interesting benchmark is one that is scored based on the LLM finding exploits in the benchmark.