There should be a way to turn the questions we ask LLMs into benchmarks.
That way, we can have a benchmark that is always up to date.