Remix clone Hacker News

new | show | ask | jobs Github

	▲	airstrike 5 hours ago
		If you and others have any insights to share on structuring that benchmark, I'm all ears. There a new model seemingly every week so finding a way to evaluate them repeatedly would be nice. The answer may be that it's so bespoke you have to handroll every time, but my gut says there's a set of best practiced that are generally applicable.
	▲	pants2 3 hours ago \| parent [-]
		Generally, the easiest: 1. Sample a set of prompts / answers from historical usage. 2. Run that through various frontier models again and if they don't agree on some answers, hand-pick what you're looking for. 3. Test different models using OpenRouter and score each along cost / speed / accuracy dimensions against your test set. 4. Analyze the results and pick the best, then prompt-optimize to make it even better. Repeat as needed.