| ▲ | aspenmartin 2 hours ago | |||||||||||||||||||||||||
Again correct but it overstates the issue. I can say labs don’t want this. This happened arguably unintentionally in Metas llama 4 release, it went horribly, heads rolled, and like several billion dollars were paid for new talent and the org that built llama 4 was destroyed. Evals come from a million places and new evals and robust perturbations of existing evals abound. They test a variety of tasks in a variety of ways. All of them individually are flawed. Taken together the aggregate signal is highly useful as you more or less marginalize over a lot of different things. Not to mention these companies have plenty of proprietary internal measurements, they build benchmarks themselves to probe their models and then also have flywheel traffic and A/B tests. You are right to call out benchmarks but to dismiss them or not take them seriously is a mistake. | ||||||||||||||||||||||||||
| ▲ | taormina an hour ago | parent [-] | |||||||||||||||||||||||||
Listen, you can say “but benchmarks, the benchmarks!” all day long, but consumer know when we are being sold a lemon. If it can’t do the most basic of things at least as good as it used to, this is table stakes. Nevermind that if you can’t do the basic stuff, how on earth can you be trusted with more? | ||||||||||||||||||||||||||
| ||||||||||||||||||||||||||