> but without assessing whether the models are actually improving in practical use-cases
Which cases? Not trying to sound bad but you didn't even provide of cases you are using Claude\Codex\Gemini for.