The only metric that worked for me is running the same prompt 5x for each LLMs on my projects.
I keep specific branches a state where they are ready to develop new features.