Your options for evaluating AI performance are: benchmarks or vibes.
Benchmarks are a really good option to have.