▲ | marsh_mellow 6 days ago | |||||||
I don't think the absolute score means much — judge models have a tendency to score around 7/10 lol 55% vs. 45% equates to about a 36 point difference in ELO. in chess that would be two players in the same league but one with a clear edge | ||||||||
▲ | kevmo314 6 days ago | parent [-] | |||||||
Rarely are two models put head-to-head though. If Claude Sonnet 3.7 isn't able to generate a good PR review (for whatever reason), a 2% better review isn't all that strong of a value proposition. | ||||||||
|