Yes, often you see huge gains in some benchmark, then the model is ran through Aider's polyglot benchmark and doesn't even hit 60%.