Remix.run Logo
godelski 5 days ago

Sorry, I've just been hearing this response for years now... GPT-5 not SOTA enough for you all now? I remember when people told me to just use 3.5

  - Gemini 2.5 Pro[0], the top model on LLM Arena. This SOTA enough for you? It even hallucinated Python code!

  - Claude Opus 4.1, sharing that chat shares my name, so here's a screenshot[1]. I'll leave that one for you to check. 

  - Grok4 getting the right answer but using bad logic[2]

  - Kimi K2[3]

  - Mistral[4]
I'm sorry, but you can fuck off with your goal post moving. They all do it. Check yourself.

  > I am being serious
Don't lie to yourself, you never were

People like you have been using that copy-paste piss-poor logic since the GPT-3 days. The same exact error existed since those days on all those models just as it does today. You all were highly disingenuous then, and still are now. I know this comment isn't going to change your mind because you never cared about the evidence. You could have checked yourself! So you and your paperclip cult can just fuck off

[0] https://g.co/gemini/share/259b33fb64cc

[1] https://0x0.st/KXWf.png

[2] https://grok.com/s/c2hhcmQtNA%3D%3D_e15bb008-d252-4b4d-8233-...

[3] http://0x0.st/KXWv.png

[4] https://chat.mistral.ai/chat/8e94be15-61f4-4f74-be26-3a4289d...

FergusArgyll 5 days ago | parent [-]

That's very weird, before I wrote my comment I asked gpt5-thinking (yes, once) and it nailed it. I just assumed the rest would get it as well, gemini-2.5 is shocking (the code!) I hereby give you leave to be a curmudgeon for another year...

godelski 5 days ago | parent [-]

Try a few times and it'll happen. I don't think it took me more than 3 tries on any platform.

To convince me it is "reasoning", it needs to get the answer right consistently. Most attempts were actually about getting it to show its results. But pay close attention. GPT got the answer right several times but through incorrect calculations. Go check the "thinking" and see if it does a 11-9=2 calculation somewhere, I saw this >50% of the attempts. You should be able to reproduce my results in <5 minutes.

Forgive my annoyance, but we've been hearing the same argument you've made for years[0,1,2,3,4]. We're talking about models that have been reported as operating at "PhD Level" since the previous generation. People have constantly been saying "But I get the right answer" or "if you use X model it'll get it right" while missing the entire point. It never mattered if it got the answer right once, it matters that it can do it consistently. It matters how it gets the answer if you want to claim reasoning. There is still no evidence that LLMs can perform even simple math consistently, despite years of such claims[5]

[0] https://news.ycombinator.com/item?id=34113657

[1] https://news.ycombinator.com/item?id=36288834

[2] https://news.ycombinator.com/item?id=36089362

[3] https://news.ycombinator.com/item?id=37825219

[4] https://news.ycombinator.com/item?id=37825059

[5] Don't let your eyes trick you, not all those green squares are 100%... You'll also see many "look X model got it right!" in response to something tested multiple times... https://x.com/yuntiandeng/status/1889704768135905332