| ▲ | benterix 5 hours ago | |
> This is a reminder that benchmarks are meaningless – you should always curate your own out-of-sample benchmarks. Yeah I have my own set of tests and the results are a bit unsettling in the sense that sometimes older models outperform newer ones. Moreover, they change even if officially the model doesn't change. This is especially true of Gemini 2.5 pro that was performing much better on the same tests several months ago vs. now. | ||
| ▲ | adastra22 5 hours ago | parent | next [-] | |
I maintain a set of prompts and scripts for development using Claude Code. They are still all locked to using Sonnet 4 and Opus 4.1, because Sonnet 4.5 is flaming hot garbage. I’ve stopped trusting the benchmarks for anything. | ||
| ▲ | Iulioh 5 hours ago | parent | prev [-] | |
A lot of newer models are geared towards efficency and if you add the fact that more efficent models are trained on the output of less efficent (but more accurate) models.... GPT4/3o might be the best we will ever have | ||