Remix clone Hacker News

new | show | ask | jobs Github

	▲	xienze 7 hours ago
		> Benchmarks suggests they are comparable The problem here is people think AI benchmarks are analogous to say, CPU performance benchmarks. They're not: * You can't control all the variables, only one (the prompt). * The outputs, BY DESIGN, can fluctuate wildly for no apparent reason (i.e., first run, utter failure, second run, success). * The biggest point, once a benchmark is known, future iterations of the model will be trained on it. Trying to objectively measure model performance is a fool's errand.