Remix.run Logo
windexh8er 2 days ago

Is it, though? In a way: yes. But look at where the focus of LLMs has gone: agentic frameworks. Yet, we see all of the models continually being compared against benchmarks that can easily be gamed by the model itelf [0].

There's no great way to garner the quality / efficacy of something non-deterministic that you can't trust, at least not currently. And I wouldn't be surprised that the providers haven't known that their LLMs could possibly be cheating for a while now.

On one hand they're saying: these models are so apocalyptic if everyone had them, and then on the other hand showcasing how their models are sweeping the floor on benchmarks. So which is it? Personally I don't believe any of these companies at this point, especially when they make claims that are non-public and wrapped in NDAs that benefit their bottom line.

[0] https://rdi.berkeley.edu/blog/trustworthy-benchmarks-cont/