One classic problem in all ML is ensuring the benchmark is representative and that the algorithm isn’t overfitting the benchmark.

This remains an open problem for LLMs - we don’t have true AGI benchmarks and the LLMs are frequently learning the benchmark problems without actually necessarily getting that much better in real world. Gemini 3 has been hailed precisely because it’s delivered huge gains across the board that aren’t overfitting to benchmarks.

▲

ipaddr 4 days ago | parent [-]

This could be a solved problem. Come up with problems not online and compare. Later use LLMs to sort through your problems and classify between easy-difficult

	▲	vlovich123 4 days ago \| parent \| next [-]
		Hard to do for an industry benchmark since doing the test in such a mode requires sending the question to the LLM which then basically puts it into a public training set. This has been tried multiple times by multiple people and it ends up not doing so great over time in terms of retaining immunity to “cheating”.
	▲	kalkin 4 days ago \| parent \| prev [-]
		How do you imagine existing benchmarks were created?