Remix clone Hacker News

new | show | ask | jobs Github

	▲	pu_pe 4 hours ago
		He tried with a tiny model (gemma3:4b), got a range from 66 to 99. Then tried again with a small model (gemini 3.1 flash lite), the range was 48 to 64. Would a frontier model be more consistent? Perhaps this tool was optimized for more capable models?
	▲	srdjanr 3 hours ago \| parent [-]
		It makes sense to me intuitively (though I'm not sure if my reasoning is actually correct). Worse model may not "know" enough to distinguish between a 70 and a 100 candidate, so it's expected that it's output has high variance. But a better model might "know" enough, so it can be more confident and thus more consistent.