I think ppl only care about how Claude or codex does.

GPT-5.4 and Opus 4.7, specifically, agree between themselves on 65% of the claims - 95% CI 62–68%. I.e., in at least 35% of the claims, one of the two models is wrong under this 4-bucket rubric.

	▲	TaupeRanger an hour ago \| parent [-]
		but that's without internet search - everyone I know uses the models that search when they need to, and I'm sure GPT and Opus would agree on almost everything if 1) they searched when necessary, and 2) they were allowed to give context to their answers instead of being hamstrung to get specious "research" results.

▲

spprashant 3 hours ago | parent | prev | next [-]

Looks like they land at the average number of 67% disagreement.

▲

airstrike 3 hours ago | parent | prev [-]

I agree but the market is pricing way beyond that