Remix.run Logo
mgrunwald_ 44 minutes ago

As an example, 2026 GPT doesn't even agree with its 2025 self. Last year I asked it to make a hardware comparison and it correctly identified the objectively better option. Recently I asked again and this time and it got everything completely backwards.

aspenmartin 41 minutes ago | parent [-]

Models are stochastic. Did you look at pass@k? I wouldn’t be surprised if you saw a regression because these models are extremely complex and impact of various decision making downstream is complex.

mgrunwald_ 7 minutes ago | parent [-]

I ran this multiple times through GPT-4 and every single time it arrived at the same conclusion. The data was readily available and pretty clear. GPT-5 insisted that the objectively inferior option was better until I gave it my own benchmark data and it was like "Oh okay nevermind".

Gemini's answer was very opinionated and factually correct, whereas Claude gave a more nuanced answer, which was also very good.

aspenmartin 3 minutes ago | parent [-]

This sounds perfectly reasonable and consistent with our current understanding of these models