Remix.run Logo
YuukiRey 6 days ago

I share examples of LLM fails on our company Slack and every week LLMs do the opposite of what I tell them.

I say capture logs without overriding console methods -> they override console methods.

YOU ARE NOT ALLOWED TO CHANGE THE TESTS -> test changed

Or they insert various sleep calls into a test to work around race conditions.

This is all from Claude Sonnet 4.

carb 6 days ago | parent | next [-]

I've found better results when I treat LLMs like you would treat little kids. Don't tell them what NOT to do, tell them what TO do.

Say "keep your hands at your side, it's hot" and not "don't touch the stove, it's hot". If you say the latter, most kids touch the stove.

alpaca128 6 days ago | parent | next [-]

If LLMs cannot reliably deal with this, how can they write reliable code? Following an instruction like "don't do X" is more basic than the logic of fizzbuzz.

This reminds me of the query "shirt without stripes" on any online image/product search.

zahlman 5 days ago | parent [-]

Obligatory reminder that we used to live in a world where you could put "foo -bar" into a search engine, ctrl-F for foo on the top ten results and find it every time, and ctrl-F for bar on the top ten results and not find it.

alpaca128 3 days ago | parent [-]

Yeah, I've even had cases where DDG ignored my quoted string in the search. It's literally the whole point of the quotes but especially when it contains things like German umlauts it'll just accept any replacement letter for them. And yes, getting no results is acceptable, in fact it is the only correct outcome.

amai 4 days ago | parent | prev | next [-]

Negation is a hard problem for AI and mainly unsolved:

- https://seantrott.substack.com/p/llms-and-the-not-problem

- https://github.com/elsamuko/Shirt-without-Stripes

glitchcrab 6 days ago | parent | prev [-]

My eureka moment when I first started using Cursor a few weeks back was realising that I talking to it the same way I talk to my three year old and the results were fairly good (less so from my boy at times).

IshKebab 6 days ago | parent [-]

Yeah it's also kind of funny people discovering all the LLM failure modes and saying "see! humans would never do that! it's not really intelligent!". None of those people have children...

Chinjut 6 days ago | parent | next [-]

I don't want a computer that's as unreliable as a child. This is not what originally interested me about computers.

IshKebab 5 days ago | parent | next [-]

Nobody said you did. I'm talking about the confidently incorrect assertions that humans would never display any of these unreliable behaviours.

tripzilch 3 days ago | parent [-]

They don't. At least not for the duration that LLMs keep it up. They really don't.

If you want to pretend that being a 3 year old is not a transient state, and that controlling an AI is just like parenting an eternal 3 year old, there's probably a manga about that.

jama211 4 days ago | parent | prev [-]

[flagged]

tripzilch 3 days ago | parent | prev [-]

Maybe because none of those people are imagining children to be eternally stuck at that level of intelligence. At that age (regardless of being a parent or not) you can literally see them getting smarter over the course of weeks or months.

sothatsit 6 days ago | parent | prev | next [-]

I have also had this happen, but only when my context is getting too long, at which point models stop reading my instructions. Or if there have been too many back and forths, this can happen as well.

Tthere is a steady decline in model's capabilities across the board as their contexts get longer. Wiping the slate clean regularly really helps to counteract this, but it can really become a pain to rebuild the context from scratch over and over. Unfortunately, I don't really know any other way to avoid the model's getting really dumb over time.

maelito 6 days ago | parent | prev | next [-]

LLMs erasing your important comments is so irritating ! Happened to me often.

toenail 6 days ago | parent | prev | next [-]

I simply had claude write me a linting tool that catches its repeated bad stuff..

TheRealDunkirk 6 days ago | parent [-]

I was converting all the views in my Rails app from HAML to ERB. It was doing each one perfectly, so I told it to do the rest. It went through a few, then asked me if it could write a program, and run that. I thought, hey, cool, sure. I get it; it was trying to save tokens. Clever! However -- you know where this is going -- despite knowing all the rules, and demonstrating it could apply them, the program it wrote made a total dog's breakfast out of the rest of the files. Thankfully, I've learned to commit my working copy before big "AI" changes, and I just revert when it barfs. I forced Claude to do the rest "manually" at great token expense, but it did it correctly. I've asked it to write other scripts, which it has also mangled. So I haven't been impressed at Claude's "tool writing" capability yet, and I'm jealous of people who seem to have good luck.

polynomial 5 days ago | parent [-]

Imagine if you had to do this with an actual team member.

paulcole 6 days ago | parent | prev | next [-]

> I share examples of LLM fails on our company Slack and every week LLMs do the opposite of what I tell them.

Must be fun.

iamflimflam1 6 days ago | parent | prev | next [-]

Do you also share examples of when it works really well?

pinoy420 6 days ago | parent | prev [-]

[dead]