Remix.run Logo
InkCanon 8 months ago

>4.1 Was better in 55% of cases

Um, isn't that just a fancy way of saying it is slightly better

>Score of 6.81 against 6.66

So very slightly better

wiz21c 8 months ago | parent | next [-]

"they found that GPT‑4.1 excels at both precision..."

They didn't say it is better than Claude at precision etc. Just that it excels.

Unfortunately, AI has still not concluded that manipulations by the marketing dept is a plague...

kevmo314 8 months ago | parent | prev | next [-]

A great way to upsell 2% better! I should start doing that.

neuroelectron 8 months ago | parent [-]

Good marketing if you're selling a discount all purpose cleaner, not so much for an API.

marsh_mellow 8 months ago | parent | prev [-]

I don't think the absolute score means much — judge models have a tendency to score around 7/10 lol

55% vs. 45% equates to about a 36 point difference in ELO. in chess that would be two players in the same league but one with a clear edge

kevmo314 8 months ago | parent [-]

Rarely are two models put head-to-head though. If Claude Sonnet 3.7 isn't able to generate a good PR review (for whatever reason), a 2% better review isn't all that strong of a value proposition.

swyx 8 months ago | parent [-]

the point is oai is saying they have a viable Claude Sonnet competitor now