| ▲ | rTX5CMRXIfFG an hour ago | |
Am I crazy, or are these differences between the best models so marginal that you’d get roughly the same performance if you use the same high-quality harness (ie preloaded instructions from md files, including custom skills)? | ||
| ▲ | bluegatty an hour ago | parent | next [-] | |
You will immediately notice the difference if you use it at the threshold. It's like most people just watching a 'starting nba player' (not superstar, but just starting player) vs one that sits on the bench. If you were to just watching them play, work out, shoot - you'd never notice the difference. Put them head to head and it's 98-54 and you start to see the patterns. It's pretty interesting actually, someone tell me what the 'science' for this is, I'm sure there is some kind of information theory at work here. Software has innumerable kinds of problems at varying level of complexity and so it provides the perfect testbed for seeing how far models can go in practice. Should add: you're very right to hint that harness, tooling, and models tuned o both the harness and he kinds of things people do on the harness, as well as some other things do make enormous difference. Bu and large, SOTA Codex/Claude Code are substantially better - at least for now. That may change. | ||
| ▲ | nl an hour ago | parent | prev | next [-] | |
The difference is very noticeable as your codebase gets bigger and you give higher and higher level tasks. I've certainly had things that Opus fixed using some kind of work around that GPT-5.5 actually solved. And the difference between the Sonnet/Gemini/DeepSeek tier to the Opus/GPT-5.5 tier is immediately obvious. | ||
| ▲ | minimaxir an hour ago | parent | prev | next [-] | |
To an extent. I've had GPT 5.5 solve problems that Opus 4.7 struggled with, using an identical AGENTS.md/CLAUDE.md and no skills. | ||
| ▲ | raincole an hour ago | parent | prev | next [-] | |
By definition the differences between "best models" are small. It's tautology. If a model is significantly dumber than the others then it's not one of the best models. | ||
| ▲ | Sparkyte an hour ago | parent | prev [-] | |
No you're not wrong. Many people will see what you see. Enthusiasts will see it as monumental squeezing out that last drop of performance. In my opinion I think it is okay for enthusiasts to feel that way. I'm just satisfied with getting a tool as an aid. Personal opinion we need to focus more on efficiency instead of how large or complex a model can get as that model creeps into more resource requirements. If the goal is to cost a billion dollars to operate than we've really lost the idea of what models are supposed to be achieving. | ||