▲ | morsecodist 21 hours ago | |
I think it actually makes sense to trust your vibes more than benchmarks. The act of creating a benchmark is the hard part. If we had a perfect benchmark AI problems would be trivially solvable. Benchmarks are meaningless on their own, they are supposed to be a proxy for actual usefulness. I'm not sure what is better than, can it do what I want? And for me the ratio of yes to no on that hasn't changed too much. | ||
▲ | hodgehog11 4 minutes ago | parent [-] | |
I agree that this is a sensible judgement for practical use, but my point is that the vibes likely will change, it's just a matter of when. You can't draw a trendline on a nonlinear metric especially when you have no knowledge of the inflection point. Individual benchmarks are certainly fallible, and we always need better ones, but the aggregate of all of the benchmarks together (and other theoretical metrics not based on test data) is correlating reasonably well with opinion polling and these are all improving at a consistent rate. It's just that it's unclear when these model improvements will lead to the outcomes that you're looking for. When it happens, it will appear like a massive leap in performance, but really it's just a threshold being hit. |