| ▲ | gpm 2 hours ago | |
Benchmarks shortcomings are no worse... they inevitably measure something that is only close to the thing you actually care about, not the thing you actually care about. It's entirely plausible that this decreased benchmark score is because Anthropic's initial prompting of the model was overtuned to the benchmark and as they're gaining more experience with real world use they are changing the prompt to do better at that and consequentially worse at the benchmark. | ||
| ▲ | billylo 2 hours ago | parent [-] | |
I wonder how best we can measure the usefulness of models going forward. Thumbs up or down? (could be useful for trends) Usage growth from the same user over time? (as an approximation) Tone of user responses? (Don't do this... this is the wrong path... etc.) | ||