Remix clone Hacker News

new | show | ask | jobs Github

	▲	yorwba 4 hours ago
		Why 45 times in particular? If you want 80% power to distinguish a model at 50% from a model at 51%, you need 39,440 samples per model, or 329 samples per question per model. But that would just give you a more precise estimate of how well the model does on those 120 questions in particular. If you want a more precise estimate of how well the model might do on future questions you come up with, you'll need to test more questions, not just test the same question more times.