| ▲ | pu_pe 4 hours ago | |
He tried with a tiny model (gemma3:4b), got a range from 66 to 99. Then tried again with a small model (gemini 3.1 flash lite), the range was 48 to 64. Would a frontier model be more consistent? Perhaps this tool was optimized for more capable models? | ||
| ▲ | srdjanr 3 hours ago | parent [-] | |
It makes sense to me intuitively (though I'm not sure if my reasoning is actually correct). Worse model may not "know" enough to distinguish between a 70 and a 100 candidate, so it's expected that it's output has high variance. But a better model might "know" enough, so it can be more confident and thus more consistent. | ||