It would be next to impossible for anyone without insider knowledge to prove that to be the case.
Secondly, benchmarks are public data, and these models are trained on such large amounts of it that it would be impractical to ensure that some benchmark data is not part of the training set. And even if it's not, it would be safe to assume that engineers building these models would test their performance on all kinds of benchmarks, and tweak them accordingly. This happens all the time in other industries as well.
So the pelican riding a bicycle test is interesting, but it's not a performance indicator at this point.