Same with LLM benchmarks these days.
Well, the pelican benchmark is easily verifiable.
Kind of hard to judge though, it’s not really objective how good a pelican looks.
Or a bicycle!