| ▲ | modeless 4 hours ago | ||||||||||||||||
It's so difficult to compare these models because they're not running the same set of evals. I think literally the only eval variant that was reported for both Opus 4.6 and GPT-5.3-Codex is Terminal-Bench 2.0, with Opus 4.6 at 65.4% and GPT-5.3-Codex at 77.3%. None of the other evals were identical, so the numbers for them are not comparable. | |||||||||||||||||
| ▲ | alexhans 4 hours ago | parent | next [-] | ||||||||||||||||
Isn't the best eval the one you build yourself, for your own use cases and value production? I encourage people to try. You can even timebox it and come up with some simple things that might look initially insufficient but that discomfort is actually a sign that there's something there. Very similar to moving from not having unit/integration tests for design or regression and starting to have them. | |||||||||||||||||
| ▲ | rsanek 4 hours ago | parent | prev | next [-] | ||||||||||||||||
I usually wait to see what ArtificialAnalysis says for a direct comparison. | |||||||||||||||||
| ▲ | input_sh 4 hours ago | parent | prev [-] | ||||||||||||||||
It's better on a benchmark I've never heard of!? That is groundbreaking, I'm switching immediately! | |||||||||||||||||
| |||||||||||||||||