| ▲ | zamadatix 11 hours ago | |||||||||||||||||||||||||||||||
A new benchmark comes out, it's designed so nothing does well at it, the models max it out, and the cycle repeats. This could either describe massive growth of LLM coding abilities or a disconnect between what the new benchmarks are measuring & why new models are scoring well after enough time. In the former assumption there is no limit to the growth of scores... but there is also not very much actual growth (if any at all). In the latter the growth matches, but the reality of using the tools does not seem to say they've actually gotten >10x better at writing code for me in the last year. Whether an individual human could do well across all tasks in a benchmark is probably not the right question to be asking a benchmark to measure. It's quite easy to construct benchmark tasks a human can't do well in that you don't even need AI to do better. | ||||||||||||||||||||||||||||||||
| ▲ | falcor84 10 hours ago | parent [-] | |||||||||||||||||||||||||||||||
Your mileage may vary, but for me, working today with the latest version of Claude Code on a non-trivial python web dev project, I do absolutely feel that I can hand over to the AI coding tasks that are 10 times more complex or time consuming than what I could hand over to copilot or windsurf a year ago. It's still nowhere close to replacing me, but I feel that I can work at a significantly higher level. What field are you in where you feel that there might not have been any growth in capabilities at all? EDIT: Typo | ||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||