"The models perform differently when called via the API vs in the Gemini UI."
This shouldn't be surprised, e.g. the model != the product. The same way GPT4o behaves differently than the ChatGPT product when using GPT4o.