Remix.run Logo
jstummbillig 2 hours ago

> so you need to tell them the specifics

That is the entire point, right? Us having to specify things that we would never specify when talking to a human. You would not start with "The car is functional. The tank is filled with gas. I have my keys." As soon as we are required to do that for the model to any extend that is a problem and not a detail (regardless that those of us, who are familiar with the matter, do build separate mental models of the llm and are able to work around it).

This is a neatly isolated toy-case, which is interesting, because we can assume similar issues arise in more complex cases, only then it's much harder to reason about why something fails when it does.

tgv 4 minutes ago | parent | next [-]

> Us having to specify things that we would never specify

This is known, since 1969, as the frame problem: https://en.wikipedia.org/wiki/Frame_problem. An LLM's grasp of this is limited by its corpora, of course, and I don't think much of that covers this problem, since it's not required for human-to-human communication.

nicbou an hour ago | parent | prev | next [-]

I get that issue constantly. I somehow can't get any LLM to ask me clarifying questions before spitting out a wall of text with incorrect assumptions. I find it particularly frustrating.

Pxtl 14 minutes ago | parent [-]

In general spitting out a scrollbar of text when asked a simple question that you've misunderstood is not, in any real sense, a "chat".

Jacques2Marais 2 hours ago | parent | prev | next [-]

You would be surprised, however, at how much detail humans also need to understand each other. We often want AI to just "understand" us in ways many people may not initially have understood us without extra communication.

jstummbillig 2 hours ago | parent | next [-]

People poorly specifying problems and having bad models of what the other party can know (and then being surprised by the outcome) is certainly a more general albeit mostly separate issue.

ahofmann 2 hours ago | parent [-]

This issue is the main reason why a big percentage of jobs in the world exist. I don't have hard numbers, but my intuition is that about 30% of all jobs are mainly "understand what side a wants and communicate this to side b, so that they understand". Or another perspective: almost all jobs that are called "knowledge work" are like this. Software development is mainly this. Side a are humans, side b is the computer. The main goal of ai seems to get into this space and make a lot of people superflous and this also (partly) explains why everyone is pouring this amount of money into ai.

PaulRobinson an hour ago | parent [-]

Developers are - on average - terrible at this. If they weren't, TPMs, Product Managers, CTOs, none of them would need to exist.

It's not specific to software, it's the entire World of business. Most knowledge work is translation from one domain/perspective to another. Not even knowledge work, actually. I've been reading some works by Adler[0] recently, and he makes a strong case for "meaning" only having a sense to humans, and actually each human each having a completely different and isolated "meaning" to even the simplest of things like a piece of stone. If there is difference and nuance to be found when it comes to a rock, what hope have we got when it comes to deep philosophy or the design of complex machines and software?

LLMs are not very good at this right now, but if they became a lot better at, they would a) become more useful and b) the work done to get them there would tell us a lot about human communication.

[0] https://en.wikipedia.org/wiki/Alfred_Adler

londons_explore 2 hours ago | parent | prev | next [-]

This is why we fed it the whole internet and every library as training data...

By now it should know this stuff.

jasongi 10 minutes ago | parent [-]

Future models know it now, assuming they suck in mastodon and/or hacker news.

Although I don't think they actually "know" it. This particular trick question will be in the bank just like the seahorse emoji or how many Rs in strawberry. Did they start reasoning and generalising better or did the publishing of the "trick" and the discourse around it paper over the gap?

I wonder if in the future we will trade these AI tells like 0days, keeping them secret so they don't get patched out at the next model update.

kitd 12 minutes ago | parent | prev | next [-]

Given that an estimated 70% of human communication is non-verbal, it's not so surprising though.

scott_w 43 minutes ago | parent | prev | next [-]

> You would be surprised, however, at how much detail humans also need to understand each other.

But in this given case, the context can be inferred. Why would I ask whether I should walk or drive to the car wash if my car is already at the car wash?

pickleRick243 38 minutes ago | parent [-]

But also why would you ask whether you should walk or drive if the car is at home? Either way the answer is obvious, and there is no way to interpret it except as a trick question. Of course, the parsimonious assumption is that the car is at home so assuming that the car is at the car wash is a questionable choice to say the least (otherwise there would be 2 cars in the situation, which the question doesn't mention).

DharmaPolice 10 minutes ago | parent | next [-]

I think a good rule of thumb is to default to assuming a question is asked in good faith (i.e. it's not a trick question). That goes for human beings and chat/AI models.

In fact, it's particularly true for AI models because the question could have been generated by some kind of automated process. e.g. I write my schedule out and then ask the model to plan my day. The "go 50 metres to car wash" bit might just be a step in my day.

scott_w 15 minutes ago | parent | prev [-]

But you're ascribing understanding to the LLM, which is not what it's doing. If the LLM understood you, it would realise it's a trick question and, assuming it was British, reply with "You'd drive it because how else would you get it to the car wash you absolute tit."

Even the higher level reasoning, while answering the question correctly, don't grasp the higher context that the question is obviously a trick question. They still answer earnestly. Granted, it is a tool that is doing what you want (answering a question) but let's not ascribe higher understanding than what is clearly observed - and also based on what we know about how LLMs work.

j_maffe 2 hours ago | parent | prev | next [-]

Right. But, unlike AI, we are usually aware when we're lacking context and inquire before giving an answer.

dxdm 2 hours ago | parent [-]

Wouldn't that be nice. I've been party and witness to enough misunderstandings to know that this is far from universally true, even for people like me who are more primed than average to spot missing context.

jiggawatts an hour ago | parent | prev [-]

I regularly tell new people at work to be extremely careful when making requests through the service desk — manned entirely by humans — because the experience is akin to making a wish from an evil genie.

You will get exactly what you asked for, not what you wanted… probably. (Random occurrences are always a possibility.)

E.g.: I may ask someone to submit a ticket to “extend my account expiry”.

They’ll submit: “Unlock Jiggawatts’ account”

The service desk will reset my password (and neglect to tell me), leaving my expired account locked out in multiple orthogonal ways.

That’s on a good day.

Last week they created Jiggawatts2.

The AIs have got to be better than this, surely!

I suspect they already are.

People are testing them with trick questions while the human examiner is on edge, aware of and looking for the twist.

Meanwhile ordinary people struggle with concepts like “forward my email verbatim instead of creatively rephrasing it to what you incorrectly though it must have really meant.”

scott_w 42 minutes ago | parent [-]

There's a lot of overlap between the smartest bears and the dumbest humans. However, we would want our tools to be more useful than the dumbest humans...

nearbuy 2 hours ago | parent | prev | next [-]

I think part of the failure is that it has this helpful assistant personality that's a bit too eager to give you the benefit of the doubt. It tries to interpret your prompt as reasonable if it can. It can interpret it as you just wanting to check if there's a queue.

Speculatively, it's falling for the trick question partly for the same reason a human might, but this tendency is pushing it to fail more.

grey-area an hour ago | parent [-]

It’s just not intelligent or reasoning, and this sort of question exposes that more clearly.

Surely anyone who has used these tools is familiar with the sometimes insane things they try to do (deleting tests, incorrect code, changing the wrong files etc etc). They get amazingly far by predicting the most likely response and having a large corpus but it has become very clear that this approach has significant limitations and is not general AI, nor in my view will it lead to it. There is no model of the world here but rather a model of words in the corpus - for many simple tasks that have been documented that is enough but it is not reasoning.

I don’t really understand why this is so hard to accept.

fauigerzigerk an hour ago | parent [-]

I agree completely. I'm tempted to call it a clear falsification of any "reasoning" claim that some of these models have in their name.

But I think it's possible that there is an early cost optimisation step that prevents a short and seemingly simple question even getting passed through to the system's reasoning machinery.

However, I haven't read anything on current model architectures suggesting that their so called "reasoning" is anything other than more elaborate pattern matching. So these errors would still happen but perhaps not quite as egregiously.

ssl-3 2 hours ago | parent | prev | next [-]

The question is so outlandish that it is something that nobody would ever ask another human. But if someone did, then they'd reasonably expect to get a response consisting 100% of snark.

But the specificity required for a machine to deliver an apt and snark-free answer is -- somehow -- even more outlandish?

I'm not sure that I see it quite that way.

necovek an hour ago | parent | next [-]

Humans ask each other silly questions all the time: a human confronted with a question like this would either blurb out a bad response like "walk" without thinking before realizing what they are suggesting, or pause and respond with "to get your car washed, you need to get it there so you must drive".

Now, humans, other than not even thinking (which is really similar to how basic LLMs work), can easily fall victim to context too: if your boss, who never pranks you like this, asked you to take his car to a car wash, and asked if you'll walk or drive but to consider the environmental impact, you might get stumped and respond wrong too.

(and if it's flat or downhill, you might even push the car for 50m ;))

shakna 2 hours ago | parent | prev | next [-]

But the number of outlandish requests in business logic is countless.

Like... In most accounting things, once end-dated and confirmed, a record should cascade that end-date to children and should not be able to repeat the process... Unless you have some data-cleaning validation bypass. Then you can repeat the process as much as you like. And maybe not cascade to children.

There are more exceptions, than there are rules, the moment you get any international pipeline involved.

ssl-3 2 hours ago | parent [-]

So, in human interaction: When the business logic goes wrong because it was described with a lack of specificity, then: Who gets blamed for this?

shakna 23 minutes ago | parent | next [-]

I wasn't specific, because I'd rather not piss of my employer. But anyone who works in a similar space will recognise the pattern.

It's not underspecified. More... Overspecified. Because it needs to be. But AI will assume that "impossible" things never happen, and choose a happy path guaranteed to result in failure.

You have to build for bad data. Comes with any business of age. Comes with international transactions. Comes with human mistakes that just build up over the decades.

The apparent current state of a thing, is not representative of its history, and what it may or may not contain. And so you have nonsensical rules, that are aimed at catching the bad data, so you have a chance to transform it into good data when it gets used, without needing to mine the entire petabytes of historical data you have sitting around in advance.

necovek an hour ago | parent | prev [-]

Depends on what was missing.

If we used MacOS throughout the org, and we asked a SW dev team to build inventory tracking software without specifying the OS, I'd squarely put the blame on SW team for building it for Linux or Windows.

(Yes, it should be a blameless culture, but if an obvious assumption like this is broken, someone is intentionally messing with you most likely)

There exists an expected level of context knowledge that is frequently underspecified.

coldtea 2 hours ago | parent | prev | next [-]

>The question is so outlandish that it is something that nobody would ever ask another human

There is an endless variety of quizes just like that humans ask other humans for fun, there is a whole lot of "trick questions" humans ask other humans to trip them up, and there are all kinds of seemingly normal questions with dumb assumptions quite close to that humans exchange.

jstummbillig 2 hours ago | parent | prev | next [-]

I'd be entirely fine with a humorous response. The Gemini flash answer that was posted somewhere in this thread is delightful.

Agentlien 2 hours ago | parent | prev [-]

I've used a few facetious comments in ChatGPT conversations. It invariably misses it and takes my words at face value. Even when prompted that there's sarcasm here which you missed, it apologizes and is unable to figure out what it's missing.

I don't know if it's a lack of intellect or the post-training crippling it with its helpful persona. I suspect a bit of both.

anon_anon12 2 hours ago | parent | prev | next [-]

Exactly, if an AI is able to curb around the basics, only then is it revolutionary

BoredPositron 2 hours ago | parent | prev [-]

I would ask you to stop being a dumb ass if you asked me the question...

coldtea 2 hours ago | parent [-]

Only to be tripped up by countless "hidden assumptions" questions similar to that that humans regularly get in