Remix clone Hacker News

new | show | ask | jobs Github

	▲	imiric 6 hours ago
		That's a sensible approach, but it still won't give you 100% confidence. These tools produce different output even when given the same context and prompt. You can't really be certain that the output difference is due to isolating any single variable.
	▲	pamelafox 6 hours ago \| parent \| next [-]
		So true! I've also setup automated evaluations using the GitHub Copilot SDK so that I can re-run the same prompt and measure results. I only use that when I want even more confidence, and typically when I want to more precisely compare models. I do find that the results have been fairly similar across runs for the same model/prompt/settings, even though we cannot set seed for most models/agents.
	▲	ChrisGreenHeur 5 hours ago \| parent \| prev [-]
		same with people, no matter what info you give a person you cant be sure they will follow it the same every time