| ▲ | PUSH_AX 16 hours ago | |
They set themselves up for flack when they use whatever these evals are… they did the same for composer 2 which was evaled in close competition with frontier models, spoiler alert, it wasn’t even close in practice. So now 2.5 is supposed to compete with opus 4.7? Sure… | ||
| ▲ | tuo-lei 15 hours ago | parent | next [-] | |
they say it themselves in the post - behavior dimensions "not well captured by existing benchmarks". that was the exact problem with composer 2. not dumber on individual tasks, just bad at session-level decisions like when to stop editing, how much context to carry forward, when to re-read a file vs assume. you don't catch any of that in an isolated eval. | ||
| ▲ | infecto 11 hours ago | parent | prev | next [-] | |
As I have said before in prior composer threads. The proof is in the usage. I am inclined to somewhat believe the results as I use composer and also take the results for the given context. It’s not a general purpose sota model. It’s a model that runs inexpensively in their coding workflow that is creating results similar to opus or gpt. | ||
| ▲ | criemen 15 hours ago | parent | prev [-] | |
Well is that a statement about the quality of Opus 4.7 or about compose 2.5? :P | ||