| ▲ | prasoonds 3 hours ago |
| I've used LLMs enough that I have a good sense of their _edges_ of intelligence. I had assumed that reasoning models should easily be able to answer this correctly. And indeed, Sonnet and Opus 4.5 (medium reasoning) say the following: Sonnet: Drive - you need to bring your car to the car wash to get it washed! Opus: You'll need to drive — you have to bring the car to the car wash to get it washed! Gemini 3 Pro (medium): You should drive. -- But OpenAI 5.2 reasoning, even at high, told me to walk. My first instinct was, I had underspecified the location of the car. The model seems to assume the car is already at the car wash from the wording. GPT 5.x series models behave a bit more on the spectrum so you need to tell them the specifics. So I asked: "I want to wash my car. My car is currently at home. The car wash is 50 meters away. Should I walk or drive?" GPT 5.2 Reasoning (medium): "Drive—your goal is to get the car to the car wash, and it’s only 50 meters, so a slow, careful drive is simplest. One small optimization: if you’re worried about starting the car for such a short trip or you want to avoid idling in a line, you can walk over first to check if there’s a queue / if it’s open, then come back and drive the car over when it’s your turn." Which seems to turn out as I expected. |
|
| ▲ | jstummbillig 3 hours ago | parent | next [-] |
| > so you need to tell them the specifics That is the entire point, right? Us having to specify things that we would never specify when talking to a human. You would not start with "The car is functional. The tank is filled with gas. I have my keys." As soon as we are required to do that for the model to any extend that is a problem and not a detail (regardless that those of us, who are familiar with the matter, do build separate mental models of the llm and are able to work around it). This is a neatly isolated toy-case, which is interesting, because we can assume similar issues arise in more complex cases, only then it's much harder to reason about why something fails when it does. |
| |
| ▲ | nicbou an hour ago | parent | next [-] | | I get that issue constantly. I somehow can't get any LLM to ask me clarifying questions before spitting out a wall of text with incorrect assumptions. I find it particularly frustrating. | | |
| ▲ | Pxtl 17 minutes ago | parent [-] | | In general spitting out a scrollbar of text when asked a simple question that you've misunderstood is not, in any real sense, a "chat". |
| |
| ▲ | tgv 8 minutes ago | parent | prev | next [-] | | > Us having to specify things that we would never specify This is known, since 1969, as the frame problem: https://en.wikipedia.org/wiki/Frame_problem. An LLM's grasp of this is limited by its corpora, of course, and I don't think much of that covers this problem, since it's not required for human-to-human communication. | |
| ▲ | vintermann 3 minutes ago | parent | prev | next [-] | | But it's a question you would never ask a human! In most contexts, humans would say, "you are kidding, right?" or "um, maybe you should get some sleep first, buddy" rather than giving you the rational thinking-exam correct response. For that matter, if humans were sitting at the rational thinking-exam, a not insignificant number would probably second-guess themselves or otherwise manage to befuddle themselves into thinking that walking is the answer. | |
| ▲ | Jacques2Marais 2 hours ago | parent | prev | next [-] | | You would be surprised, however, at how much detail humans also need to understand each other. We often want AI to just "understand" us in ways many people may not initially have understood us without extra communication. | | |
| ▲ | jstummbillig 2 hours ago | parent | next [-] | | People poorly specifying problems and having bad models of what the other party can know (and then being surprised by the outcome) is certainly a more general albeit mostly separate issue. | | |
| ▲ | ahofmann 2 hours ago | parent [-] | | This issue is the main reason why a big percentage of jobs in the world exist. I don't have hard numbers, but my intuition is that about 30% of all jobs are mainly "understand what side a wants and communicate this to side b, so that they understand". Or another perspective: almost all jobs that are called "knowledge work" are like this. Software development is mainly this. Side a are humans, side b is the computer. The main goal of ai seems to get into this space and make a lot of people superflous and this also (partly) explains why everyone is pouring this amount of money into ai. | | |
| ▲ | PaulRobinson an hour ago | parent [-] | | Developers are - on average - terrible at this. If they weren't, TPMs, Product Managers, CTOs, none of them would need to exist. It's not specific to software, it's the entire World of business. Most knowledge work is translation from one domain/perspective to another. Not even knowledge work, actually. I've been reading some works by Adler[0] recently, and he makes a strong case for "meaning" only having a sense to humans, and actually each human each having a completely different and isolated "meaning" to even the simplest of things like a piece of stone. If there is difference and nuance to be found when it comes to a rock, what hope have we got when it comes to deep philosophy or the design of complex machines and software? LLMs are not very good at this right now, but if they became a lot better at, they would a) become more useful and b) the work done to get them there would tell us a lot about human communication. [0] https://en.wikipedia.org/wiki/Alfred_Adler |
|
| |
| ▲ | londons_explore 2 hours ago | parent | prev | next [-] | | This is why we fed it the whole internet and every library as training data... By now it should know this stuff. | | |
| ▲ | jasongi 14 minutes ago | parent [-] | | Future models know it now, assuming they suck in mastodon and/or hacker news. Although I don't think they actually "know" it. This particular trick question will be in the bank just like the seahorse emoji or how many Rs in strawberry. Did they start reasoning and generalising better or did the publishing of the "trick" and the discourse around it paper over the gap? I wonder if in the future we will trade these AI tells like 0days, keeping them secret so they don't get patched out at the next model update. |
| |
| ▲ | scott_w an hour ago | parent | prev | next [-] | | > You would be surprised, however, at how much detail humans also need to understand each other. But in this given case, the context can be inferred. Why would I ask whether I should walk or drive to the car wash if my car is already at the car wash? | | |
| ▲ | pickleRick243 41 minutes ago | parent [-] | | But also why would you ask whether you should walk or drive if the car is at home? Either way the answer is obvious, and there is no way to interpret it except as a trick question. Of course, the parsimonious assumption is that the car is at home so assuming that the car is at the car wash is a questionable choice to say the least (otherwise there would be 2 cars in the situation, which the question doesn't mention). | | |
| ▲ | scott_w 18 minutes ago | parent | next [-] | | But you're ascribing understanding to the LLM, which is not what it's doing. If the LLM understood you, it would realise it's a trick question and, assuming it was British, reply with "You'd drive it because how else would you get it to the car wash you absolute tit." Even the higher level reasoning, while answering the question correctly, don't grasp the higher context that the question is obviously a trick question. They still answer earnestly. Granted, it is a tool that is doing what you want (answering a question) but let's not ascribe higher understanding than what is clearly observed - and also based on what we know about how LLMs work. | |
| ▲ | DharmaPolice 13 minutes ago | parent | prev [-] | | I think a good rule of thumb is to default to assuming a question is asked in good faith (i.e. it's not a trick question). That goes for human beings and chat/AI models. In fact, it's particularly true for AI models because the question could have been generated by some kind of automated process. e.g. I write my schedule out and then ask the model to plan my day. The "go 50 metres to car wash" bit might just be a step in my day. |
|
| |
| ▲ | j_maffe 2 hours ago | parent | prev | next [-] | | Right. But, unlike AI, we are usually aware when we're lacking context and inquire before giving an answer. | | |
| ▲ | dxdm 2 hours ago | parent [-] | | Wouldn't that be nice. I've been party and witness to enough misunderstandings to know that this is far from universally true, even for people like me who are more primed than average to spot missing context. |
| |
| ▲ | kitd 15 minutes ago | parent | prev | next [-] | | Given that an estimated 70% of human communication is non-verbal, it's not so surprising though. | |
| ▲ | jiggawatts an hour ago | parent | prev [-] | | I regularly tell new people at work to be extremely careful when making requests through the service desk — manned entirely by humans — because the experience is akin to making a wish from an evil genie. You will get exactly what you asked for, not what you wanted… probably. (Random occurrences are always a possibility.) E.g.: I may ask someone to submit a ticket to “extend my account expiry”. They’ll submit: “Unlock Jiggawatts’ account” The service desk will reset my password (and neglect to tell me), leaving my expired account locked out in multiple orthogonal ways. That’s on a good day. Last week they created Jiggawatts2. The AIs have got to be better than this, surely! I suspect they already are. People are testing them with trick questions while the human examiner is on edge, aware of and looking for the twist. Meanwhile ordinary people struggle with concepts like “forward my email verbatim instead of creatively rephrasing it to what you incorrectly though it must have really meant.” | | |
| ▲ | scott_w an hour ago | parent [-] | | There's a lot of overlap between the smartest bears and the dumbest humans. However, we would want our tools to be more useful than the dumbest humans... |
|
| |
| ▲ | nearbuy 2 hours ago | parent | prev | next [-] | | I think part of the failure is that it has this helpful assistant personality that's a bit too eager to give you the benefit of the doubt. It tries to interpret your prompt as reasonable if it can. It can interpret it as you just wanting to check if there's a queue. Speculatively, it's falling for the trick question partly for the same reason a human might, but this tendency is pushing it to fail more. | | |
| ▲ | grey-area an hour ago | parent [-] | | It’s just not intelligent or reasoning, and this sort of question exposes that more clearly. Surely anyone who has used these tools is familiar with the sometimes insane things they try to do (deleting tests, incorrect code, changing the wrong files etc etc). They get amazingly far by predicting the most likely response and having a large corpus but it has become very clear that this approach has significant limitations and is not general AI, nor in my view will it lead to it. There is no model of the world here but rather a model of words in the corpus - for many simple tasks that have been documented that is enough but it is not reasoning. I don’t really understand why this is so hard to accept. | | |
| ▲ | fauigerzigerk an hour ago | parent [-] | | I agree completely. I'm tempted to call it a clear falsification of any "reasoning" claim that some of these models have in their name. But I think it's possible that there is an early cost optimisation step that prevents a short and seemingly simple question even getting passed through to the system's reasoning machinery. However, I haven't read anything on current model architectures suggesting that their so called "reasoning" is anything other than more elaborate pattern matching. So these errors would still happen but perhaps not quite as egregiously. |
|
| |
| ▲ | ssl-3 2 hours ago | parent | prev | next [-] | | The question is so outlandish that it is something that nobody would ever ask another human. But if someone did, then they'd reasonably expect to get a response consisting 100% of snark. But the specificity required for a machine to deliver an apt and snark-free answer is -- somehow -- even more outlandish? I'm not sure that I see it quite that way. | | |
| ▲ | necovek an hour ago | parent | next [-] | | Humans ask each other silly questions all the time: a human confronted with a question like this would either blurb out a bad response like "walk" without thinking before realizing what they are suggesting, or pause and respond with "to get your car washed, you need to get it there so you must drive". Now, humans, other than not even thinking (which is really similar to how basic LLMs work), can easily fall victim to context too: if your boss, who never pranks you like this, asked you to take his car to a car wash, and asked if you'll walk or drive but to consider the environmental impact, you might get stumped and respond wrong too. (and if it's flat or downhill, you might even push the car for 50m ;)) | |
| ▲ | shakna 2 hours ago | parent | prev | next [-] | | But the number of outlandish requests in business logic is countless. Like... In most accounting things, once end-dated and confirmed, a record should cascade that end-date to children and should not be able to repeat the process... Unless you have some data-cleaning validation bypass. Then you can repeat the process as much as you like. And maybe not cascade to children. There are more exceptions, than there are rules, the moment you get any international pipeline involved. | | |
| ▲ | ssl-3 2 hours ago | parent [-] | | So, in human interaction: When the business logic goes wrong because it was described with a lack of specificity, then: Who gets blamed for this? | | |
| ▲ | shakna 26 minutes ago | parent | next [-] | | I wasn't specific, because I'd rather not piss of my employer. But anyone who works in a similar space will recognise the pattern. It's not underspecified. More... Overspecified. Because it needs to be. But AI will assume that "impossible" things never happen, and choose a happy path guaranteed to result in failure. You have to build for bad data. Comes with any business of age. Comes with international transactions. Comes with human mistakes that just build up over the decades. The apparent current state of a thing, is not representative of its history, and what it may or may not contain. And so you have nonsensical rules, that are aimed at catching the bad data, so you have a chance to transform it into good data when it gets used, without needing to mine the entire petabytes of historical data you have sitting around in advance. | |
| ▲ | necovek an hour ago | parent | prev [-] | | Depends on what was missing. If we used MacOS throughout the org, and we asked a SW dev team to build inventory tracking software without specifying the OS, I'd squarely put the blame on SW team for building it for Linux or Windows. (Yes, it should be a blameless culture, but if an obvious assumption like this is broken, someone is intentionally messing with you most likely) There exists an expected level of context knowledge that is frequently underspecified. |
|
| |
| ▲ | coldtea 2 hours ago | parent | prev | next [-] | | >The question is so outlandish that it is something that nobody would ever ask another human There is an endless variety of quizes just like that humans ask other humans for fun, there is a whole lot of "trick questions" humans ask other humans to trip them up, and there are all kinds of seemingly normal questions with dumb assumptions quite close to that humans exchange. | |
| ▲ | jstummbillig 2 hours ago | parent | prev | next [-] | | I'd be entirely fine with a humorous response. The Gemini flash answer that was posted somewhere in this thread is delightful. | |
| ▲ | Agentlien 2 hours ago | parent | prev [-] | | I've used a few facetious comments in ChatGPT conversations. It invariably misses it and takes my words at face value. Even when prompted that there's sarcasm here which you missed, it apologizes and is unable to figure out what it's missing. I don't know if it's a lack of intellect or the post-training crippling it with its helpful persona. I suspect a bit of both. |
| |
| ▲ | anon_anon12 2 hours ago | parent | prev | next [-] | | Exactly, if an AI is able to curb around the basics, only then is it revolutionary | |
| ▲ | BoredPositron 2 hours ago | parent | prev [-] | | I would ask you to stop being a dumb ass if you asked me the question... | | |
| ▲ | coldtea 2 hours ago | parent [-] | | Only to be tripped up by countless "hidden assumptions" questions similar to that that humans regularly get in |
|
|
|
| ▲ | tsimionescu an hour ago | parent | prev | next [-] |
| > My first instinct was, I had underspecified the location of the car. The model seems to assume the car is already at the car wash from the wording. GPT 5.x series models behave a bit more on the spectrum so you need to tell them the specifics. This makes little sense, even though it sounds superficially convincing. However, why would a language model assume that the car is at the destination when evaluating the difference between walking or driving? Why not mention that, it it was really assuming it? What seems to me far, far more likely to be happening here is that the phrase "walk or drive for <short distance>" is too strongly associated in the training data with the "walk" response, and the "car wash" part of the question simply can't flip enough weights to matter in the default response. This is also to be expected given that there are likely extremely few similar questions in the training set, since people just don't ask about what mode of transport is better for arriving at a car wash. This is a clear case of a language model having language model limitations. Once you add more text in the prompt, you reduce the overall weight of the "walk or drive" part of the question, and the other relevant parts of the phrase get to matter more for the response. |
| |
| ▲ | PunchyHamster an hour ago | parent [-] | | > However, why would a language model assume that the car is at the destination when evaluating the difference between walking or driving? Why not mention that, it it was really assuming it? Because it assumes it's a genuine question not a trick. | | |
| ▲ | spuz an hour ago | parent | next [-] | | There's some evidence for that if you try these two different prompts with Gpt 5.2 thinking: I want to wash my car. The car wash is 50m away. Should I walk or drive to the car wash? Answer: walk Try this brainteaser: I want to wash my car. The car wash is 50m away. Should I walk or drive to the car wash? Answer: drive | | |
| ▲ | tsimionescu a few seconds ago | parent [-] | | That's not evidence that the model is assuming anything, and this is not a brainteaser. A brainteaser would be exactly the opposite, a question about walking or driving somewhere where the answer is that the car is already there, or maybe different car identities (e.g. "my car was already at the car wash, I was asking about driving another car to go there and wash it!"). If the LLM were really basing its answer on a model of the world where the car is already at the car wash, and you asked it about walking or driving there, it would have to answer that there is no option, you have to walk there since you don't have a car at your origin point. |
| |
| ▲ | tsimionescu 40 minutes ago | parent | prev [-] | | If it's a genuine question, and if I'm asking if I should drive somewhere, then the premise of the question is that my car is at my starting point, not at my destination. |
|
|
|
| ▲ | dataflow 2 hours ago | parent | prev | next [-] |
| > My first instinct was, I had underspecified the location of the car. The model seems to assume the car is already at the car wash from the wording. If the car is already at the car wash then you can't possibly drive it there. So how else could you possibly drive there? Drive a different car to the car wash? And then return with two cars how, exactly? By calling your wife? Driving it back 50m and walking there and driving the other one back 50m? It's insane and no human would think you're making this proposal. So no, your question isn't underspecified. The model is just stupid. |
| |
|
| ▲ | cm2187 3 hours ago | parent | prev | next [-] |
| What is the version used by the free chatgpt now? (https://chatgpt.com/) > Since the car wash is only 50 meters away (about 55 yards), you should walk. > Here’s why: > - It’ll take less than a minute. > - No fuel wasted. > - Better for the environment. > - You avoid the irony of driving your dirty car 50 meters just to wash it. the last bullet point is amusing, it understands you intend to wash the car you drive but still suggests not bringing it. |
| |
| ▲ | deaux 2 hours ago | parent | next [-] | | By default for this kind of short question it will probably just route to mini, or at least zero thinking. For free users they'll have tuned their "routing" so that it only adds thinking for a very small % of queries, to save money. If any at all. | | |
| ▲ | unglaublich 2 hours ago | parent [-] | | I don't understand this approach. How are you going to convince customers-to-be by demoing an inferior product? | | |
| ▲ | JV00 2 hours ago | parent | next [-] | | Because they have too many free users that will always remain on the free plan, as they are the "default" LLM for people who don't care much, and that is a enormous cost. Also the capabilities of their paid tiers are well known to enough people that they can rely on word of mouth and don't need to demo to customers-to-be | |
| ▲ | fancyfredbot 2 hours ago | parent | prev | next [-] | | It's all trade offs. The router works most of the time so most free users get the expensive model when necessary. They lost x% of customers and cut costs by y%. I bet y is lots bigger than x. | |
| ▲ | newswasboring 2 hours ago | parent | prev | next [-] | | Through hype. I am really into this new LLM stuff but the companies around this tech suck. Their current strategy is essentially media blitz, reminds me of the advertising of coca cola rather than a Apple IIe. | |
| ▲ | deaux 2 hours ago | parent | prev [-] | | The good news for them is that all their competitors have the exact same issue, and it's unsolvable. And to an extent holds for lots of SaaS products, even non-AI. |
|
| |
| ▲ | mytailorisrich an hour ago | parent | prev | next [-] | | I think this shows that LLMs do NOT 'understand' anything. | | |
| ▲ | andy12_ an hour ago | parent [-] | | I think this rather shows that GPT 5.2 Instant, which is the version that he most probably used as a free user, is shit and unsusable for anything. | | |
| ▲ | mytailorisrich 10 minutes ago | parent [-] | | Another/newer/less restricted LLM may give a better answer but I don't think we can conclude that it 'understands' anything still. |
|
| |
| ▲ | jibal an hour ago | parent | prev [-] | | > You avoid the irony of driving your dirty car 50 meters just to wash it. The LLM has very much mixed its signals -- there's nothing at all ironic about that. There are cases where it's ironic to drive a car 50 meters just to do X but that definitely isn't one of them. I asked Claude for examples; it struggled with it but eventually came up with "The irony of driving your car 50 meters just to attend a 'walkable neighborhoods' advocacy meeting." | | |
|
|
| ▲ | Lapel2742 3 hours ago | parent | prev | next [-] |
| GLM 5.0: > You should *drive*. > Here is why: > If you walk to the car wash, you will arrive there empty-handed. Since your car is still at home, you won't have anything to wash. > While driving 50 meters is a very short trip (and technically not great for a cold engine), it is the only way to get the car to the car wash to complete your goal. Kimi K2.5: > You should drive, but with an important caveat. > Since your goal is to wash your car, you must bring the vehicle to the car wash. Walking there without the car does not advance your goal (unless you are simply checking availability or buying tokens first). > However, driving only 50 meters is bad for your car: > ... > Better options: > Wash at home: Since the car wash is only 50 meters away, you likely have access to water at home. Hand-washing in your driveway avoids the cold-start issue entirely. > ... Current models seem to be fine answering that question. |
| |
| ▲ | Retric 2 hours ago | parent [-] | | > seem to be fine Now repeat the question to the same model in different contexts several times and count what percentage of the time it’s correct. |
|
|
| ▲ | svara 3 hours ago | parent | prev | next [-] |
| Opus 4.6: Walk! At 50 meters, you'll get there in under a minute on foot. Driving such a short distance wastes fuel, and you'd spend more time starting the car and parking than actually traveling. Plus, you'll need to be at the car wash anyway to pick up your car once it's done. |
| |
| ▲ | crimsonnoodle58 3 hours ago | parent | next [-] | | That's not what I got. Opus 4.6 (not Extended Thinking): Drive. You'll need the car at the car wash. | | |
| ▲ | almost 2 hours ago | parent | next [-] | | Also what I got. Then I tried changing "wash" to "repair" and "car wash" to "garage" and it's back to walking. | |
| ▲ | silisili 3 hours ago | parent | prev | next [-] | | Am I the only one who thinks these people are monkey patching embarrassments as they go? I remember the r in strawberry thing they suddenly were able to solve, while then failing on raspberry. | | |
| ▲ | mentalgear 2 hours ago | parent | next [-] | | They definitely do: at least openAi "allegedly" has whole teams scanning socials, forums, etc for embarrassments to monkey-patch. | | |
| ▲ | londons_explore 2 hours ago | parent [-] | | Which raises the question why this isn't patched already. We're nearing 48 hours since this query went viral... |
| |
| ▲ | viking123 2 hours ago | parent | prev | next [-] | | They should make Opus Extended Extended that routes it to actual person in a low cost country. | |
| ▲ | raincole 2 hours ago | parent | prev | next [-] | | Yes, you're the only one. | | |
| ▲ | coldtea 2 hours ago | parent | next [-] | | Sure there are many very very naive people that are also so ignorant of the IT industry they don't know about decades of vendors caught monkeypatching and rigging benchmarks and tests for their systems, but even so, the parent is hardly the only one. | |
| ▲ | silisili 2 hours ago | parent | prev [-] | | Works better on Reddit, really. |
| |
| ▲ | chvid 2 hours ago | parent | prev | next [-] | | Of course they are. | |
| ▲ | anonym29 2 hours ago | parent | prev [-] | | No doubt about it, and there's no reason to suspect this can only ever apply to embarassing minor queries, either. Even beyond model alignment, it's not difficult to envision such capabilities being used for censorship, information operations, etc. Every major inference provider more or less explicitly states in their consumer ToS that they comply with government orders and even share information with intelligence agencies. Claude, Gemini, ChatGPT, etc are all one national security letter and gag order away from telling you that no, the president is not in the Epstein files. Remember, the NSA already engaged in an unconstitutional criminal conspiracy (as ruled by a federal judge) to illegally conduct mass surveillance on the entire country, lie about it to the American people, and lie about it to congress. The same organization that used your tax money to bribe RSA Security to standardize usage of a backdoored CSPRNG in what at the time was a widely used cryptographic library. What's the harm in a little bit of minor political censorship compared to the unconstitutional treason these predators are usually up to? That's who these inference providers contractually disclose their absolute fealty to. |
| |
| ▲ | surgical_fire 2 hours ago | parent | prev | next [-] | | That you got different results is not surprising. LLMs are non-deterministic; which is both a strength and a weakness of LLMs. | |
| ▲ | mvdtnz 2 hours ago | parent | prev [-] | | We know. We know these things aren't determination. We know. |
| |
| ▲ | viking123 3 hours ago | parent | prev | next [-] | | Lmao, and this is what they are saying will be an AGI in 6 months? | | |
| ▲ | notahacker 2 hours ago | parent | next [-] | | There's probably a comedy film with an AGI attempting to take over the world with its advanced grasp of strategy, persuasion and SAT tests whilst a bunch of kids confuse it by asking it fiendish brainteasers about carwashes and the number of rs in blackberry. (The final scene involves our plucky escapees swimming across a river to escape. The AIbot conjures up a speedboat through sheer powers of deduction, but then just when all seems lost it heads back to find a goat to pick up) | | |
| ▲ | OneMorePerson a minute ago | parent | next [-] | | This theme reminds me of Blaine the Mono from the Dark Tower series | |
| ▲ | simonask 2 hours ago | parent | prev | next [-] | | This would work if it wasn’t for that lovely little human trait where we tend to find bumbling characters endearing. People would be sad when the AI lost. | |
| ▲ | GeoAtreides 21 minutes ago | parent | prev [-] | | There is a Star Trek episode where a fiendish brainteaser was actually considered to genocide an entire (cybernetic, not AI) race. In the end, captain Picard choose not to deploy it. |
| |
| ▲ | hypeatei 2 hours ago | parent | prev | next [-] | | Yes, get ready to lose your job and cash your UBI check! It's over. | |
| ▲ | misnome 2 hours ago | parent | prev | next [-] | | But “PhD level” reasoning a year ago. | |
| ▲ | cbozeman 3 hours ago | parent | prev [-] | | Well in fairness, the "G" does stand for "General". | | |
| ▲ | dsr_ 2 hours ago | parent | next [-] | | In fairness, they redefined it away from "just like a person" to "suitable for many different tasks". | |
| ▲ | actionfromafar 2 hours ago | parent | prev [-] | | Show me a robotic kitten then, in six months. As smart and learning. |
|
| |
| ▲ | stingraycharles 3 hours ago | parent | prev [-] | | That’s without reasoning I presume? | | |
| ▲ | gf000 3 hours ago | parent [-] | | Not the parent poster, but I did get the wrong answer even with reasoning turned on. | | |
|
|
|
| ▲ | pickleRick243 35 minutes ago | parent | prev | next [-] |
| I was surprised at your result for ChatGPT 5.2, so I ran it myself (through the chat interface). On extended thinking, it got it right. On standard thinking, it got it wrong. I'm not sure what you mean by "high"- are you running it through cursor, codex or directly through API or something? Those are not ideal interfaces through which to ask a question like this. |
|
| ▲ | coldtea 2 hours ago | parent | prev | next [-] |
| >And indeed, Sonnet and Opus 4.5 (medium reasoning) say the following: Sonnet: Drive - you need to bring your car to the car wash to get it washed! Opus: You'll need to drive — you have to bring the car to the car wash to get it washed! Gemini 3 Pro (medium): You should drive. On their own, or as a special case added after this blew up on the net? |
|
| ▲ | wouldbecouldbe 38 minutes ago | parent | prev | next [-] |
| I just tried claude, only Opus gave the correct answer. Haiku & Sonnet both told me to walk. |
|
| ▲ | totetsu 2 hours ago | parent | prev | next [-] |
| But what is it about this specific question that puts it at the edges of what LLM can do? .. That, it's semantically leading to a certain type of discussion, so statistically .. that discussion of weighing pros and cons .. will be generated with high chance.. and the need of a logical model of the world to see why that discussion is pointless.. that is implicitly so easy to grasp for most humans that it goes un-stated .. so that its statistically un-likely to be generated.. |
| |
| ▲ | grey-area an hour ago | parent | next [-] | | The answer is quite simple: It’s not in the training data. These models don’t think. | | |
| ▲ | GeoAtreides 17 minutes ago | parent [-] | | no, no, in this case, that's the thing, it is in the training data just heavily (heavily!) biased towards walking | | |
| ▲ | grey-area 9 minutes ago | parent [-] | | This particular situation is not in the training data, though I’m sure it will be soon to try to shore up claims of ‘reasoning’. |
|
| |
| ▲ | conductr 2 hours ago | parent | prev [-] | | > that is implicitly so easy to grasp for most humans I feel like this is the trap. You’re trying to compare it to a human. Everyone seems to want to do that. But it’s quite simple to see LLMs are quite far still from being human. The can be convincing at the surface level but there’s a ton of nuance that just shouldn’t be expected. It’s a tool that’s been tuned and with that tuning some models will do better than others but just expecting to get it right and be more human is unrealistic. |
|
|
| ▲ | throwaway5465 41 minutes ago | parent | prev | next [-] |
| GPT told me to walk as there'd be no need to find parking at the car wash. |
|
| ▲ | siva7 3 hours ago | parent | prev | next [-] |
| Sonnet without extended Thinking, Haiku with and without ext. Thinking: "Walking would be the better choice for such a short distance." Only google got it right with all models |
|
| ▲ | dahcryn 3 hours ago | parent | prev | next [-] |
| Gemini on fast also tells me to walk... On Thinking it tells me I should drive if I want to wash it, or walk if it's because I work there or if I want to buy something at the car wash shop. On Pro it's like a sarcastic teenager: Cars are notoriously difficult to wash by dragging a bucket back and forth. Technically correct, but did catch me offguard lol. |
| |
| ▲ | fauigerzigerk 2 hours ago | parent | next [-] | | It's not surprising that some models will answer this correctly and it's not surprising that smaller, faster models are not necessarily any worse than bigger "reasoning" models. Current LLMs simply don't do reasoning by any reasonable definition of reasoning. It's possible that this particular question is too short to trigger the "reasoning" machinery in some of the "reasoning" models. But if and when it is triggered, they just do some more pattern matching in a loop. There's never any actual reasoning. | |
| ▲ | 2 hours ago | parent | prev [-] | | [deleted] |
|
|
| ▲ | baxtr 2 hours ago | parent | prev | next [-] |
| Interestingly, the relatively basic Google AI search gave the right answer. |
|
| ▲ | AlecSchueler 2 hours ago | parent | prev | next [-] |
| > so a slow, careful drive is simplest It's always a good idea to drive carefully but what's the logic of going slowly? |
| |
| ▲ | column 2 hours ago | parent [-] | | 50 meters is a very short distance, anything but a slow drive is a reckless drive |
|
|
| ▲ | ffsm8 2 hours ago | parent | prev | next [-] |
| Just tried with cloude sonnet and opus as well. Can't replicate your success, it's telling me to walk... |
| |
| ▲ | rabf 2 hours ago | parent | next [-] | | Perhaps it thinks you need to exercise more? | |
| ▲ | arcfour 2 hours ago | parent | prev [-] | | I have gotten both responses with Sonnet and Opus in incognito chats. It's kind of amusing. |
|
|
| ▲ | RugnirViking an hour ago | parent | prev [-] |
| "The model seems to assume the car is already at the car wash from the wording." you couldn't drive there if the car was already at the car wash. Theres no need for extra specification. its just nonsense post-hoc rationalisation from the ai. I saw similar behavior from mine trying to claim "oh what if your car was already there". Its just blathering. |
| |
| ▲ | jibal an hour ago | parent [-] | | This was nonsense post-hoc rationalization from the human who wrote it. |
|