Remix.run Logo
data-ottawa 6 hours ago

Gemini 3 (Flash & Pro) seemingly will _always_ try and answer your question with what you give it, which I’m assuming is what drives the mentioned ethics violations/“unhinged” behaviour.

Gemini’s strength definitely is that it can use that whole large context window, and it’s the first Gemini model to write acceptable SQL. But I agree completely at being awful at decisions.

I’ve been building a data-agent tool (similar to [1][2]). Gemini 3’s main failure cases are that it makes up metrics that really are not appropriate, and it will use inappropriate data and force it into a conclusion. When a task is clear + possible then it’s amazing. When a task is hard with multiple failure paths then you run into Gemini powering through to get an answer.

Temperature seems to play a huge role in Gemini’s decision quality from what I see in my evals, so you can probably tune it to get better answers but I don’t have the recipe yet.

Claude 4+ (Opus & Sonnet) family have been much more honest, but the short context windows really hurt on these analytical use cases, plus it can over-focus on minutia and needs to be course corrected. ChatGPT looks okay but I have not tested it. I’ve been pretty frustrated at ChatGPT models acting one way in the dev console and completely different in production.

[1] https://openai.com/index/inside-our-in-house-data-agent/ [2] https://docs.cloud.google.com/bigquery/docs/conversational-a...