Remix clone Hacker News

new | show | ask | jobs Github

	▲	swyx 8 hours ago
		*50 unique problems but 20-40 rubrics per problem (something I had to keep reminding people internally who were unimpressed with the N) simple answer is our reporting was pass@5. feel like you'd need like 50+ runs to have reasonable confidence intervals, which somehow i dont see other people do, so i also didnt insist on it. hoping to work with <prominent third party evals shop> to get this on their infra and evaluated along with whatever the industry standard is.
	▲	tedsanders 7 hours ago \| parent [-]
		Makes sense, thanks. I suppose error bars are tricky if trying to handle problem-to-problem variance, rubric-to-rubric variance, and run-to-run variance all at once.