| ▲ | __MatrixMan__ 7 hours ago |
| It seems like it's approaching a horizontal asymptote to me, or is at the very least concave down. You might be describing a state 50 years from now. |
|
| ▲ | aurareturn 5 hours ago | parent | next [-] |
| It seems like progress is accelerating, not slowing down. ARC AGI 2: https://x.com/poetiq_ai/status/2003546910427361402 METR: https://metr.org/blog/2025-03-19-measuring-ai-ability-to-com... |
| |
| ▲ | __MatrixMan__ 4 hours ago | parent [-] | | Improved benchmarks are undeniably an improvement, but the bottleneck isn't the models anymore, it's the context engineering necessary to harness them. The more time and effort we put into our benchmarking systems the better we're able to differentiate between models, but then when you take an allegedly smart one and try to do something real with it, it behaves like a dumb one again because you haven't put as much work into the harness for the actual task you've asked it to do as you did into the benchmark suite. The knowledge necessary to do real work with these things is still mostly locked up in the humans that have traditionally done that work. |
|
|
| ▲ | anthonypasq 7 hours ago | parent | prev [-] |
| sonnet 3.7 was released 10 months ago! (the first model truly capable of any sort of reasonable agentic coding at all) and opus 4.5 exists today. |
| |
| ▲ | rabf 6 hours ago | parent [-] | | To add to this: the tooling or `harness` around the models has vastly improved as well. You can get far better results with older or smaller models today than you could 10 months ago. |
|