| ▲ | ipunchghosts 3 hours ago | |||||||
I think ppl only care about how Claude or codex does. | ||||||||
| ▲ | kostaj 2 hours ago | parent | next [-] | |||||||
GPT-5.4 and Opus 4.7, specifically, agree between themselves on 65% of the claims - 95% CI 62–68%. I.e., in at least 35% of the claims, one of the two models is wrong under this 4-bucket rubric. | ||||||||
| ||||||||
| ▲ | spprashant 3 hours ago | parent | prev | next [-] | |||||||
Looks like they land at the average number of 67% disagreement. | ||||||||
| ▲ | airstrike 3 hours ago | parent | prev [-] | |||||||
I agree but the market is pricing way beyond that | ||||||||