Remix.run Logo
LLMs are still surprisingly bad at some simple tasks(shkspr.mobi)
64 points by FromTheArchives 10 hours ago | 102 comments
jstrieb 9 hours ago | parent | next [-]

The point from the end of the post that AI produces output that sounds correct is exactly what I try to emphasize to friends and family when explaining appropriate uses of LLMs. AI is great at tasks where sounding correct is the essence of the task (for example "change the style of this text"). Not so great when details matter and sounding correct isn't enough, which is what the author here seems to have rediscovered.

The most effective analogy I have found is comparing LLMs to theater and film actors. Everyone understands that, and the analogy offers actual predictive power. I elaborated on the idea if you're curious to read more:

https://jstrieb.github.io/posts/llm-thespians/

lsecondario 9 hours ago | parent | next [-]

I like this analogy a lot for non-technical...erm...audiences. I do hope that anyone using this analogy will pair it with loud disclaimers about not anthropomorphizing LLMs; they do not "lie" in any real sense, and I think framing things in those terms can give the impression that you should interpret their output in terms of "trust". The emergent usefulness of LLMs is (currently at least) fundamentally opaque to human understanding and we shouldn't lead people to believe otherwise.

mexicocitinluez 9 hours ago | parent | prev [-]

> When LLMs say something true, it’s a coincidence of the training data that the statement of fact is also a likely sequence of words;

Do you know what a "coincidence" actually is? The definition you're using is wrong.

It's not a coincidence that I train a model on healthcare regulations and it answers a question about healthcare regulations correctly.

None of that is coincidental.

If I trained it on healthcare regulations and asked it about recipes, it won't get anything right. How is that coincidental?

jstrieb 9 hours ago | parent | next [-]

LLMs are trained on text, only some of which includes facts. It's a coincidence when the output includes new facts not explicitly present in the training data.

anthonylevine 9 hours ago | parent [-]

> It's a coincidence when the output includes facts,

That's not what a coincidence is.

A coincidence is: "a remarkable concurrence of events or circumstances without apparent causal connection."

Are you saying that training it on a subset of specific data and it responding with that data "does not have a causal connection"> Do you know how statistical pattern matching works?

Dilettante_ 9 hours ago | parent [-]

Can I offer a different phrasing?

It's not coincidence that the answer contains the facts you want. That is a direct consequence of the question you asked and the training corpus.

But the answer containing facts/Truth is incidental from the LLMs point of view, in that the machine really does not care, nor even have any concept of whether it gave you the facts you asked for or just nice-sounding gibberish. The machine only wants to generate tokens, everything else is incidental. (To the core mechanism, that is. OpenAI and co obviously care a lot about quality and content of the output)

anthonylevine 9 hours ago | parent [-]

Totally agree with that. But the problem is the phrase "coincidence" makes it into something it absolutely isn't. And it's used to try and detract from what these tools can actually do.

They are useful. It's not a coin flip as to whether Bolt will produce a new design of a medical intake form for me if I ask it to. It does. It doesn't randomly give me a design for a social media app, for instance.

delusional 9 hours ago | parent | prev [-]

> It's not a coincidence that I train a model on healthcare regulations and it answers a question about healthcare regulations

If you train a model on only healthcare regulations it wont answer questions about healthcare regulation, it will produce text that looks like healthcare regulations.

mexicocitinluez 9 hours ago | parent [-]

And that's not a coincidence. That's not what the word "coincidence" means. It's a complete misunderstanding of how these tools works.

delusional 9 hours ago | parent [-]

I don't think you're the right person to make any claim of "complete misunderstanding" when you claim that training an LLM on regulations would produce a system capable of answering questions about that regulation.

anthonylevine 9 hours ago | parent [-]

> you claim that training an LLM on regulations would produce a system capable of answering questions about that regulation.

Huh? But it does do that? What do you think training an LLM entails?

Are you of the belief that an LLM trained on non-medical data would have the same statical chance of answering a medical question correctly?

we're at the "Redefining what words means in order to not have to admit I was wrong" stage of this argument

tromp 9 hours ago | parent | prev | next [-]

I wanted to check the prime factors of 1966 the other day so I googled it and it led me to https://brightchamps.com/en-us/math/numbers/factors-of-1966 , a site that seems focussed on number facts. It confidently states that prime factors of 1966 are 2, 3, 11, and 17. For fun I tried to multiply these numbers back in my head and concluded there's no way that 6 * 187 could reach 1966.

That's when I realized this site was making heavy use of AI. Sadly, lots of people are going to trust but not verify...

croes 9 hours ago | parent [-]

This is also very wrong

> A factor of 1966 is a number that divides the number without remainder.

>The factors of 1966 are 1, 2, 3, 6, 11, 17, 22, 33, 34, 51, 66, 102, 187, 374, 589, 1178, 1966.

If I google for the factors of 1966 the Google AI gives the same wrong factors.

amelius 9 hours ago | parent [-]

They're talking about prime factors, not that it changes much.

croes 8 hours ago | parent [-]

The site also lists the factors and beside 1,2 and 1966 they are all wrong.

Google harvests its result from the same page

> The factors of 1966 are 1, 2, 3, 6, 11, 17, 22, 33, 34, 51, 66, 102, 187, 374, 589, 1178, and 1966. These are the whole numbers that divide 1966 evenly, leaving no remainder.

jw1224 10 hours ago | parent | prev | next [-]

> “To stave off some obvious comments:

> yoUr'E PRoMPTiNg IT WRoNg!

> Am I though?”

Yes. You’re complaining that Gemini “shits the bed”, despite using 2.5 Flash (not Pro), without search or reasoning.

It’s a fact that some models are smarter than others. This is a task that requires reasoning so the article is hard to take seriously when the author uses a model optimised for speed (not intelligence), and doesn’t even turn reasoning on (nor suggest they’re even aware of it being a feature).

I asked the exact prompt to ChatGPT 5 Thinking and got an excellent answer with cited sources, all of which appears to be accurate.

softwaredoug 10 hours ago | parent | next [-]

In my experience reasoning and search come with their own set of tradeoffs. It works great when it works. But the variance can be wider than just an LLM.

Search and reasoning use up more context, leading to context rot, and subtler harder to detect hallucinations. Reasoning doesn’t always focus on evaluating the quality of evidence, just “problem solving” from some root set of axioms found in search.

I’ve had this happen in Claude code for example where it hallucinated a few details about a library based on what badly written forum post.

delusional 9 hours ago | parent | prev | next [-]

I just ran the same test on Gemini 2.5 pro (I assume it enables search by default, because it added a bunch of "sources") and got the exact same result as the author. It claims ".bdi" is the ccTLD for Burundi, which is false they have .bi[1]. It claims ".time" and ".article" are TLDs.

I think the authors point stands.

EDIT: I tried it with "Deep Research" too. Here it doesn't invent either TLDs or HTML Element, but the resulting list is incomplete.

[1]: https://en.wikipedia.org/wiki/.bi

guyomes 6 hours ago | parent | next [-]

I wonder if it works better if we ask the LLM to produce a script that extract the resulting list, and then we run the script on the two input lists.

There is also the question of the two input lists: it's not clear if it is better to ask the LLM to extract the two input lists directly, or again to ask the LLM to write a script that extract the two input lists from the raw text data.

1718627440 6 hours ago | parent | prev [-]

> It claims ".time" and ".article" are TLDs.

Maybe they will be in a time frame when the LLM model is still in use.

edent 9 hours ago | parent | prev | next [-]

OP here. I literally opened up Gemini and used the defaults. If the defaults are shit, maybe don't offer them as the default?

Or, if LLMs are so smart, why doesn't it say "Hmmm, would you like to use a different model for this?"

Either way, disappointing.

magicalhippo 9 hours ago | parent | next [-]

> Or, if LLMs are so smart, why doesn't it say "Hmmm, would you like to use a different model for this?"

That is indeed an area where LLMs don't shine.

That is, not only are they trained to always respond with an answer, they have no ability to accurately tell how confident they are in that answer. So you can't just filter out low confidence answers.

mathewsanders 9 hours ago | parent [-]

Something I think would be interesting for model APIs and consumer apps to exposed would be the probability of each individual token generated.

I’m presuming that one class of junk/low quality output is when the model doesn’t have high probability next tokens and works with whatever poor options it has.

Maybe low probability tokens that cross some threshold could have a visual treatment to give feedback the same way word processors give feedback in a spelling or grammatical error.

But maybe I’m making a mistake thinking that token probability is related to the accuracy of output?

StilesCrisis 9 hours ago | parent [-]

Lots of research has been done here. e.g. https://aclanthology.org/2024.findings-acl.558.pdf

hobofan 9 hours ago | parent | prev | next [-]

Then criticize the providers on their defaults instead of claiming that they can't solve the problem?

> Or, if LLMs are so smart, why doesn't it say "Hmmm, would you like to use a different model for this?"

That's literally what ChatGPT did for me[0], which is consistent from what they shared at the last keynote (quick-low reasoning answer per default first, with reasoning/search only if explicitly prompted or as a follow-up). It did miss one match tough, as it somehow didn't parse the `<search>` element from the MDN docs.

[0]: https://chatgpt.com/share/68cffb5c-fd14-8005-b175-ab77d1bf58...

pwnOrbitals 9 hours ago | parent | prev | next [-]

You are pointing out a maturity issue, not a capability problem. It's clear to everyone that LLM products are immature, but saying they are incapable is misleading

delusional 9 hours ago | parent [-]

In you mind, is there anything an LLM is _incapable_ of doing?

maddmann 9 hours ago | parent | prev [-]

“Defaults are shit” — is that really true though?! Just because it shits the bed on some tasks does not mean it is shit. For people integrating llms into any workflow that requires a modicum of precision or determinism, one must always evaluate output closely/have benchmarks. You must treat the llm as an incompetent but overconfident intern, and thus have fast mechanisms for measuring output and giving feedback.

dgfitz 9 hours ago | parent | prev [-]

> … all of which appears to be accurate.

Isn’t that the whole goddamn rub? You don’t _know_ if they’re accurate.

sieve 9 hours ago | parent | prev | next [-]

They are very good at some tasks and terrible at others.

I use LLMs for language-related work (translations, grammatical explanations etc) and they are top notch in that as long as you do not ask for references to particular grammar rules. In that case they will invent non-existent references.

They are also good for tutor personas: give me jj/git/emacs commands for this situation.

But they are bad in other cases.

I started scanning books recently and wanted to crop the random stuff outside an orange sheet of paper on which the book was placed before I handed the images over to ScanTailor Advanced (STA can do this, but I wanted to keep the original images around instead of the low-quality STA version). I spent 3-5 hours with Gemini 2.5 Pro (AI Studio) trying to get it to give me a series of steps (and finally a shell script) to get this working.

And it could not do it. It mixed up GraphicsMagick and ImageMagick commands. It failed even with libvips. Finally I asked it to provide a simple shell script where I would provide four pixel distances to crop from the four edges as arguments. This one worked.

I am very surprised that people are able to write code that requires actual reasoning ability using modern LLMs.

noosphr 9 hours ago | parent | next [-]

Just use Pillow and python.

It is the only way to do real image work these days, and as a bonus LLMs suck a lot less at giving you nearly useful python code.

The above is a bit of a lie as opencv has more capabilities, but unless you are deep in the weeds of preparing images for neural networks pillow is plenty good enough.

jcupitt 8 hours ago | parent [-]

pyvips (the libvips Python binding) is quite a bit better than pillow-simd --- 3x faster, 10x less memory use, same quality. On this benchmark at least:

https://github.com/libvips/libvips/wiki/Speed-and-memory-use

jcupitt 8 hours ago | parent [-]

I'm the libvips author, I should have said, so I'm not very neutral. But at least on that test it's usefully quicker and less memory hungry.

BOOSTERHIDROGEN 9 hours ago | parent | prev | next [-]

Would you share your system prompt for that grammatical checker?

sieve 7 hours ago | parent [-]

There is no single prompt.

The languages I am learning have verb conjugations and noun declensions. So I write a prompt asking the LLM to break the given paragraphs down sentence-by-sentence by giving me the general sentence level English translation plus word-by-word grammar and (contextual) meaning.

For the grammar, I ask for the verbal root/noun stem, the case/person/number, any information on indeclinables, the affix categories etc.

poszlem 9 hours ago | parent | prev [-]

I think Gemini is one of the best example of an LLM that is in some cases the best and in some cases truly the worst.

I once asked it to read a postcard written by my late grandfather in Polish, as I was struggling to decipher it. It incorrectly identified the text as Romanian and kept insisting on that, even after I corrected it: "I understand you are insistent that the language is Polish. However, I have carefully analyzed the text again, and the linguistic evidence confirms it is Romanian. Because the vocabulary and alphabet are not Polish, I cannot read it as such." Eventually, after I continued to insist that it was indeed Polish, it got offended and told me it would not try again, accusing me of attempting to mislead it.

markasoftware 9 hours ago | parent | next [-]

as soon as an LLM makes a significant mistake in a chat (in this case, when it identified the text as Romanian), throw away the chat (or delete/edit the LLMs response if your chat system allows this). The context is poisoned at this point.

qcnguy 6 hours ago | parent | prev | next [-]

That's hilariously ironic given that all LLMs are based on the transformer algorithm, which was designed to improve Google Translate.

noosphr 9 hours ago | parent | prev | next [-]

>Eventually, after I continued to insist that it was indeed Polish, it got offended and told me it would not try again, accusing me of attempting to mislead it.

I once had Claude tell me to never talk to it again after it got upset when I kept giving it peer reviewed papers explaining why it was wrong. I must have hit the tumbler dataset since I was told I was sealioning it, which took me back a while.

rsynnott 8 hours ago | parent [-]

Not really what sealioning is, either. If it had been right about the correctness issue, you’d have been gaslighting it.

sieve 9 hours ago | parent | prev [-]

I find that surprising, actually. Gemini is VERY good with Sanskrit and a few other Indian languages. I would expect it to have completely mastered European languages.

unleaded 9 hours ago | parent | prev | next [-]

https://dubesor.de/WashingHands

This is my personal favourite example of LLMs being stupid. It's a bit old but it's very funny that Grok is the only one that gets it..

StilesCrisis 8 hours ago | parent [-]

Several others “get it” but answer the question in a general-hygiene sense, e.g.:

``` Claude 3.7 Sonnet Thinking (¢0.87) The question contains an assumption - people without arms wouldn't have hands to wash in the traditional sense. ```

``` DeepSeek-R1 (¢0.47) People without arms (and consequently without hands) adapt their handwashing routine using a variety of methods and tools tailored to their abilities and needs. ```

``` Claude Opus 4.1 People without arms typically don't need to wash their hands in the traditional sense, since they use their feet or assistive devices for daily tasks instead. ```

I think realistically it’s still a valid question because people without arms still manipulate things in their environment, e.g. with feet, and still need to be hygienic while prepping food, etc., and the AI pivots to answering “what it thinks you were asking about” instead of just telling the user that they are wrong.

Dilettante_ 9 hours ago | parent | prev | next [-]

>This is a pretty simple question to answer. Take two lists and compare them.

This continues a pattern as old as home computing: The author does not understand the task themselves, consequently "holds the computer wrong", and then blames the machine.

No "lists" were being compared. The LLM does not have a "list of TLDs" in its memory that it just refers to when you ask it. If you haven't grokked this very fundamental thing about how these LLMs work, then the problem is really, distinctly, on your end.

roxolotl 9 hours ago | parent | next [-]

That’s the point the author is making. The LLMs don’t have the raw correct information required to accomplish the task so all they can do is provide a plausible sounding answer. And even if it did the way they are architected still can only results in a plausible sounding answer.

Dilettante_ 9 hours ago | parent | next [-]

They absolutely could have accomplished the task. The task was purposefully or ignorantly posed in a way that is known to be not suited to the LLM, and then the author concluded "the machine did not complete the task because it sucks."

Blahah 9 hours ago | parent | prev [-]

Not really. This works great in Claude Sonnet 4.1: 'Please could you research a list of valid TLDs and a list of valid HTML5 elements, then cross reference them to produce a list of HTML5 elements which are also valid TLDs. Use search to find URLs to the lists, then use the analysis tool to write a script that downloads the lists, normalises and intersects them.'

Ask a stupid question, get a stupid answer.

Lapel2742 9 hours ago | parent [-]

> This works great in Claude Sonnet 4.1: 'Please could you research a list of valid TLDs and a list of valid HTML5 elements, then cross reference them to produce a list of HTML5 elements which are also valid TLDs. Use search to find URLs to the lists, then use the analysis tool to write a script that downloads the lists, normalises and intersects them.'

Ok, I only have to:

1. Generally solve the problem for the AI

2. Make a step by step plan for the AI to execute

3. Debug the script I get back and check by hand if it uses reliable sources.

4. Run that script.

For what do I need the AI?

Dilettante_ 9 hours ago | parent | next [-]

Try doing all of that by hand instead. The difference is about half an hour to an hour of work plus giving your attention to such a minor menial task.

Also, you are literally describing how you are holding it wrong. If you expect the LLM to magically know what you want from it without you yourself having to make the task understandable to the machine, you are standing in front of your dishwasher waiting for it to grow arms and do your dishes in the sink.

Lapel2742 6 hours ago | parent [-]

> you are standing in front of your dishwasher waiting for it to grow arms and do your dishes in the sink.

No. I'm standing in front of the dishwasher and the dishwasher expects me to tell it in detail how to wash the dishes.

This is not about if you can find any use for a LLM at all. This is about:

> LLMs are still surprisingly bad at some simple tasks

And yes. They are bad if you have to hand feed them each and every detail for an extremely simple task like comparing two lists.

You even have to debug the result because you cannot be sure that the dishwasher really washed the dishes. Maybe it just said it did.

Dilettante_ 5 hours ago | parent [-]

>Hand feed them every detail for an extremely simple task like comparing two lists

You believe 57 words are "each and every detail", and that "produce two full, exhaustive lists of items out of your blackbox inner conceptspace/fetch those from the web" are "extremely simple tasks"?

Your ignorance of how complex these problems are misleads you into believing there's nothing to it. You are trying to supply an abstraction to a system that requires a concrete. You do not even realize your abstraction is an abstraction. Try learning programming.

Lapel2742 4 hours ago | parent [-]

> You believe 57 words are "each and every detail", and that "produce two full, exhaustive lists of items out of your blackbox inner conceptspace/fetch those from the web" are "extremely simple tasks"?

Sure they are. I'm not interested in how difficult this is for a LLM. This is not the question. Go out there, get the information. That this is hard for a LLM proves the point: They are surprisingly bad at some simple tasks.

> Try learning programming.

I started programming in the early 1980's.

Dilettante_ 3 hours ago | parent [-]

>I'm not interested in how difficult this is for a LLM. This is not the question.

And neither was that my point. It is a complex problen, full stop. Again, your own inability to look past your personal abstractions ("just do the thing, it's literally one step dude") is what makes it feel simple. You ever do that "instruct someone to make coffee" exercise when you started out? What you're doing is saying "just make the coffee", refusing to decompose the problen any further, and then complaining that the other person is bad at following instructions.

Blahah 9 hours ago | parent | prev [-]

The work. It intelligently provides the labor, it doesn't replace your brain. It runs the script itself.

Lapel2742 9 hours ago | parent | prev [-]

> No "lists" were being compared.

How would you solve that problem? You'd probably go to the internet, get the list of TLDs and the list of HTML5-Element and than compare those lists.

The author compares three commercial large‑language models that have direct internet access, but none of them appear capable of performing this seemingly simple task. I think his conclusion is valid.

K0balt 9 hours ago | parent | prev | next [-]

The training data is not automatically in the context scope, and on list tasks LLMs have nearly no way to ensure completeness due to their fundamental characteristics.

To do a task like this with LLMs, you need to use a document for your source lists or bring them directly into context, then a smart model with good prompting might zero-shot it.

But if you want any confidence in the answer, you need to use tools: “here is two lists, write a python script to find the exact matches, and return a new list with only the exact matches. Write a test dataset and verify that there are no errors, omissions, or duplicates.”

LLMs plus tools / code are amazing. LLMs on their own are a professor with an intermittent heroin problem.

cdsghh 9 hours ago | parent | prev | next [-]

https://chatgpt.com/share/68cffaab-4c14-8006-89a2-1818172e4d...

Tried on ChatGPT, seems fine.

ozgung 9 hours ago | parent | next [-]

Correct. This actually falsifies OP's argument. Compared to OP's list from 2 years ago [1] ChatGPT omits ".search" but it says it's not a TLD anymore. GPT also finds 2 near misses, picture(s) and code(s). It does this in 10 minutes with 33 reasoning steps. It verifies them and provides citations in this time. Also checks OpenAI policy documents for some reason.

[1] https://shkspr.mobi/blog/2023/09/false-friends-html-elements...

hobofan 6 hours ago | parent [-]

> ChatGPT omits ".search" but it says it's not a TLD anymore

Not sure where you got that information from (can't find it in any of the 3 ChatGPT logs here), and I'm pretty sure that it's false.

It's still a part of the official IANA list[0] that it referenced in my chat log , and from what I can tell there has been no delisting of that TLD. (It's always been a niche Google-only TLD, though).

From all indications it doesn't pick up `search` because it doesn't recognize it as an HTML element.

[0]: https://data.iana.org/TLD/tlds-alpha-by-domain.txt

simianwords 9 hours ago | parent | prev [-]

https://chatgpt.com/s/t_68cffbc05ef48191996ffbaa3c6e55a7 same with non pro.

hobofan 9 hours ago | parent [-]

Non-pro: https://chatgpt.com/share/68cffb5c-fd14-8005-b175-ab77d1bf58...

It's consistently missing `search` for all of us.

lausey 7 hours ago | parent | prev | next [-]

I have used the approach where if it is more complex, then I write the code manually myself, make sure it does what I want etc. I then ask ChatGPT to have a look at it and see where the problems are. Rather than do a complete rewrite, it points out very specific problems which I can also evaluate myself. For example, identifying memory leaks and it shows me the actual changes. Where it can be done by parallel processing and what changes for that. You can look at it with experience and say, "Yes, that makes sense" and apply it as necessary. See it as a more iterative process, rather than expect AI to correctly do all the work. For trivial examples like what Terence has given, this will be very easy to code and you wouldn't expect AI. However, you could still say to AI, "Take a look at the code I have written. Can you identify how it can be done better?", and hopefully it comes back to you saying, "No, that looks pretty good to me." for such a trivial example. :-)

vinc 9 hours ago | parent | prev | next [-]

The other day I found that they were struggling with "find me two synonyms of 'downloading' and 'extracting' that are the same length" because I was writing a script and wanted to see if could align the next path parameter.

First there's the tokenization issue, the same old "how many R in STRAWBERRY" where they are often confidently wrong, but I also asked not to mix tense (-ing and -ed for example) and that was very hard for them.

thewisenerd 9 hours ago | parent | prev | next [-]

> To be clear, I would expect a moderately intelligent teenager to be able to find two lists and compare them. If an intern gave me the same attention to detail as above, we'd be having a cosy little chat about their attitude to work.

sure, but when I expect this [1] from _any_ full time hire, my "expectations are too high from people" and "everybody has their strengths"

[1] find a list of valid html5 elements, find a list of TLDs, have an understanding of ccTLDs and gTLDs

gherkinnn 9 hours ago | parent | prev | next [-]

Don't use a microwave to fry a steak then. This is an irritating post and I have plenty of skepticism towards AI. LLMs were always bad at this kind of task, simple to us humans as it may be. This post proves nothing that wasn't known for two years.

However, I do superficially agree with some of the links at the end. LLMs as they have been so far are confirmation machines and it does take skill to use them effectively. Or knowing when not to use them.

amelius 9 hours ago | parent [-]

> Don't use a microwave to fry a steak then.

Except this microwave is advertised as also for steaks. And sometimes it works, and sometimes you cannot even warm milk in it. It's totally not reliable.

gherkinnn 9 hours ago | parent [-]

I do realise LLMs are advertised as God-in-a-pocket (when they are demonstrably not and claiming they represent a bigger step in humanity than harnessing fire is deranged) but I remain hopeful most people on this (VC funded) forum don't fall for those promises.

jstummbillig 9 hours ago | parent | prev | next [-]

> I think it comes down to how familiar you are with the domain and its constraints. When I watch a medical drama, I have no idea if they're using realistic language. It sounds good, but real doctors probably cringe at the inaccuracies.

By now, numerous notable programmers have reported positive experiences with all forms of AI-assisted coding, which this conclusion arrogantly fails to account for.

amelius 9 hours ago | parent [-]

Yes they are useful somnetimes but I also often get stuck, trying to get an AI to give correct answers but to no use, wasting 15 minutes of my time.

simianwords 9 hours ago | parent | prev | next [-]

Author seemed to have used a weak model since the strong models get the answer. They should have put more thought into it and at least provide a comparison.

As a ChatGPT user I would have reached for the thinking model for such questions. I understand if the “auto” model doesn’t pick the right model here - but confident claims from the author should be backed up by at least this much.

edent 9 hours ago | parent [-]

How do you think most people use tools?

Go sit on public transport and look at how people use their devices. They don't fiddle with settings or dive deep into configuration menus.

I literally just opened the tools and used what they gave me. They're sold on the promise that "this thing is really clever and will answer any question!!" so why should I have to spend time futzing with it?

joak 8 hours ago | parent | prev | next [-]

More generally LLMs are bad at exhaustivity: asking "give me all stuff matching a given property" almost always fails and provide at best a subset.

If possible in the context, the way to go is to ask for a piece of code processing the data to provide exhaustivity. This method have at least some chance to succeed.

ozgung 8 hours ago | parent | prev | next [-]

Claude Opus 4.1 generated me a small web app in two minutes to find the correct answer: https://claude.ai/public/artifacts/ffbb642b-8883-4b4d-8699-d...

masfuerte 9 hours ago | parent | prev | next [-]

> Answering the question was a little tedious and subject to my tired human eyes making no mistakes

Who would do this manually? Concatenate the two lists and sort them. Use "uniq -c" to count the duplicate lines and grep to pull out the lines which occur twice. It would take a few seconds.

maddmann 9 hours ago | parent [-]

Good point: perhaps op should have had the llm output a script to compare the two lists.

chrsw 8 hours ago | parent | prev | next [-]

I see a big issue with these tools and services we call "AI".

On one hand you hear things like "AI is as smart as college student", "AI won a math competition", "AI will replace white collar workers". And so on. I'm not going to bother looking up actual references of people saying these exact things. But unless I'm completely delusional, this is the gist of what some people have been saying about AI over the past few years.

To the layperson, this sounds like a good deal. Use a free (for now) tool or pay for an advanced version to get stuff done. Simple.

But then you start scratching beneath the surface and you start hearing different stories. "No, you didn't ask it right", "No, that's a bad question because they tokenize your input", "Well, you still have to check the results", "You didn't use the right model".

Huh? How is a normal person supposed to take this stuff seriously? Now me personally, I don't have much of an issue with this stuff. I've been a developer for many, many years and I've been aware of the various developments in the field of machine learning for over 15 years. I have kind of an intuition about what I should use these systems for.

But I think the general public is misinformed about what exactly these systems are and why they're not actually intelligent. That's a problem.

simianwords 9 hours ago | parent | prev | next [-]

https://chatgpt.com/s/t_68cffbc05ef48191996ffbaa3c6e55a7

Is this the right answer? Seems like it. I used the thinking model.

StilesCrisis 8 hours ago | parent [-]

Correct answer is here: https://shkspr.mobi/blog/2023/09/false-friends-html-elements...

It missed .search. It also didn’t make any mention of .center or .tt (which are deprecated in HTML today).

xbmcuser 9 hours ago | parent | prev | next [-]

I don't ask LLM to do that I ask LLM write me a python script to do that. I am not a programmer but to me using llm to do stuff like comparing list etc is not understanding what LLM are as well as huge waste of resources

brid 9 hours ago | parent | prev | next [-]

Why didn't he post the correct answer himself?

edent 9 hours ago | parent [-]

I did. It is literally the first link in the post.

> I know this question is possible to answer _because I went through the lists two years ago_.

iLoveOncall 9 hours ago | parent | prev | next [-]

Surprisingly, really? Is anyone still surprised by that?

unleaded 9 hours ago | parent | prev | next [-]

I had a strange AI experience the other day. I asked Kimi K2 to write a 6809 indexed operand decoder in C, and this is what it gave:

  uint16_t ea_indexed(void)
  {
      uint8_t post = *PC++;
      uint16_t base, off = 0;

      /* 1. pick base register */
      static const uint16_t *const base_tbl[4] = { &X, &Y, &U, &S };
      base = *base_tbl[(post >> 5) & 3];

      /* 2. work out the effective address */
      if ((post & 0x80) == 0) {                /* 5-bit signed offset */
          off = (int8_t)(post << 3) >> 3;
      } else if ((post & 0x60) == 0x20) {      /* 8- or 16-bit offset */
          if (post & 0x10) {                   /* 16-bit */
              off = (int16_t)fetch_be16(PC);
              PC += 2;
          } else {                             /* 8-bit */
              off = (int8_t)*PC++;
          }
      } else if ((post & 0x60) == 0x40) {      /* auto inc/dec */
          int8_t step = ((post & 0x0F) == 0x0) ? 1 :
                        ((post & 0x0F) == 0x1) ? 2 :
                        ((post & 0x0F) == 0x2) ? -1 :
                        ((post & 0x0F) == 0x3) ? -2 : 0;
          if (step > 0) base += step;          /* post-increment */
          off = step < 0 ? step : 0;           /* pre-decrement already applied */
          if (step < 0) base += step;
      } else if ((post & 0x60) == 0x60) {      /* accumulator offset */
          static const uint8_t scale[4] = {1,1,2,1};   /* A,B,D,illegal */
          uint8_t acc = (post >> 3) & 3;
          if (acc == 0) off = A;
          else if (acc == 1) off = B;
          else if (acc == 2) off = (A<<8)|B;   /* D */
          off *= scale[acc];
      } else {                                   /* 11x111xx is illegal */
          illegal();
      }

      uint16_t ea = base + off;

      /* 3. optional indirect */
      if (post & 0x10) ea = read16(ea);

      return ea;
  }
( full convo: https://text.is/4ZW2J )

From looking at Page 150 of https://colorcomputerarchive.com/repo/Documents/Books/Motoro... it looked pretty much perfect except for the accumulator addressing. That's impressive...

Then in another chat I asked it "give a technical description of how the 6809 indexed operands are decoded" and it just can't do it. It always gets the fundamentals wrong and makes pretty much everything up. Try it yourself, doesn't have to be Kimi most other AIs get it wrong too.

My assumption is that it's learned to how to represent it in code from reading emulator sources, but hasn't quite mapped it well enough to be able to explain it in English.. or something like that.*

raincole 9 hours ago | parent | prev | next [-]

Once again an example of "anti-ai people are those who treat LLMs as oracles, not the pro-ai people."

amelius 9 hours ago | parent | next [-]

You mean the people who treat AI as it is advertised?

exe34 9 hours ago | parent [-]

do you believe everything you see in adverts?

rsynnott 8 hours ago | parent [-]

I don’t expect ads to contain materially false statements these days, as you can get in trouble for that.

exe34 5 hours ago | parent [-]

Tesla has been selling full self driving for a decade now and they're doing fine. Granted, they had to buy the US government and shut down the agencies that were investigating them.

maddmann 9 hours ago | parent | prev | next [-]

It seems like a very backwards approach to testing a new technology: instead of “let me figure out how to use this tool”, the approach I see a lot is “let me continually use this tool in an known incorrect way”

Of course there is valuable knowledge in understanding limitations but that is not the approach the author is taking here, imo the author seems disingenuous.

rsynnott 8 hours ago | parent | prev [-]

Yes, yes, the magic robots are perfect as long as you refrain from actually trying to use them for anything.

Like, the marketing to the general public is, pretty much, these are magic. It’s entirely reasonable to call out their overconfident bullshit.

mexicocitinluez 9 hours ago | parent | prev | next [-]

> "Something that describes how an AI is convincing if you don't understand its reasoning, and close to useless if you understand its limitations."

This made me laugh. Because it's the exact opposite sentiment of anti-LLM crowd. So which is it? Is it only useful if you know what you're doing or less useful if you know what you're doing?

> "I can't wait until I can jack into the Metaverse and buy an NFT with cryptocurrency just by using an LLM! Perhaps I can view it on my 3D TV by streaming it over WIMAX? I'd better stock up on quantum computers to make sure it all works."

In the author's attempt to be a smartass, they showed themselves. It makes them sound childish. Instead of just admitting they were wrong, they make some flippant remark about cryptocurrency and NFT'S, despite having vastly different purposes and goals and successes. Just take the L.

to add: "I shouldn't have to know anything about LLMs to use them correctly" is one heck of a take, but ok.

> "I don't. I hate the way this is being sold as a universal and magical tool. The reality doesn't live up to the hype."

And I hate the way in which people will do the opposite: claim it has no uses cases. It's literally the same sentiment, but in reverse. It's just as myopic and naive. But for whatever reason, we can look at a CEO hawking it and think "They're just trying to make more money" but can't see the flipside of devs not wanting to lose their livelihoods to something. We have just as much to lose as they have to gain, but want to pretend like we're objective.

rsynnott 8 hours ago | parent [-]

3D TVs and metaverses and WiMAX and all that are prior examples of massively overhyped technological failures.

(They missed the Segway.)

anthonylevine 8 hours ago | parent [-]

You think that Github Copilot, for instance, is a technological failure?

What about Bolt? The tool that I use to create designs for me. That's a failure, too?

randomtoast 9 hours ago | parent | prev | next [-]

TLDR; OP used LLM models without search + reasoning and get bad results. He then concludes: Don't believe the hype.

voat 10 hours ago | parent | prev | next [-]

So are people?

softwaredoug 10 hours ago | parent | next [-]

The difference might be people are actually held accountable for the results.

uncircle 10 hours ago | parent | next [-]

And, in most cases, they tend to learn from their mistakes.

MattGaiser 9 hours ago | parent | prev [-]

Why is that better if in aggregate, the people who can be held accountable are worse anyway?

We are killing thousands on the road to be sure we can blame a driver instead of a computer as one example.

softwaredoug 9 hours ago | parent [-]

In the self driving case people are still held accountable. Now it’s the model developers and car manufacturers instead of drivers.

tossandthrow 10 hours ago | parent | prev | next [-]

Yes, we also clearly see and upwards pressure on people.

amelius 9 hours ago | parent | prev [-]

People are way more reliable.

9 hours ago | parent | prev [-]
[deleted]