A friend is starting a company to do evals by just pitting models agent each other in simulations. Their teaser video is good (and humorous!)
https://kradle.ai/