It's not just figuring out if a model is good at things, but is it good at the things I care about.
Using a targeted eval suite (like a test suite) tells us that.