| ▲ | cautiouscat 2 hours ago | |
In the automotive world we have benchmarks in HP/torque with the dyno. That’s expensive though, so many depend on their “butt dyno” to judge if their fresh new parts and tune made a difference. I’m curious how this will feel to my code “butt dyno”. I haven’t noticed much between Opus and Sonnet. I’m comparing this difference to the early days of Claude in 2025. It does what I need and both need a little bit of correction and whatnot. Benchmarks are nice, but I want to see how this feels. Looking forward to trying it later tonight. | ||
| ▲ | sunir 2 hours ago | parent [-] | |
I have a similar question. I think most software projects have reached the point that the speed of capturing real information about what the winner's circle looks like, and therefore what the program should be, so many magnitudes slower than the amount of code that can be generated in the wrong direction. I'd need to measure these new models on well understood but complex problems that are relatively easy to validate to get a sense if they are 'better'; on the other hand, the real impact in daily life may be marginal since generating code is not the biggest problem at the moment. | ||