| ▲ | faxmeyourcode 33 minutes ago | |
I had a hunch that opus 4.7 hedged more than other models - and it turns out it's true
datasette query herehttps://lite.datasette.io/?csv=https%3A%2F%2Fstatic.simonwil... | ||
| ▲ | kostaj 27 minutes ago | parent [-] | |
This is in line with my observations and tests as well. Also supported by the distribution of the verdicts across the 4-buckets -- Gemini uses the middle buckets (Mostly True and Misleading) much less often - 6% combined for Gemini w/o search. And Opus uses them the most - 45% combined. Looks like Gemini is calibrated to be confident and Opus to be careful. | ||