Remix clone Hacker News

new | show | ask | jobs Github

	▲	pamelafox 6 hours ago
		So true! I've also setup automated evaluations using the GitHub Copilot SDK so that I can re-run the same prompt and measure results. I only use that when I want even more confidence, and typically when I want to more precisely compare models. I do find that the results have been fairly similar across runs for the same model/prompt/settings, even though we cannot set seed for most models/agents.