Remix.run Logo
the_snooze 3 hours ago

Two things can be true at the same time: The technology has improved, and the technology in its current state still isn't fit for purpose.

I stress test commercially deployed LLMs like Gemini and Claude with trivial tasks: sports trivia, fixing recipes, explaining board game rules, etc. It works well like 95% of the time. That's fine for inconsequential things. But you'd have to be deeply irresponsible to accept that kind of error rate on things that actually matter.

The most intellectually honest way to evaluate these things is how they behave now on real tasks. Not with some unfalsifiable appeal to the future of "oh, they'll fix it."

hedgehog 3 hours ago | parent | next [-]

The errors are also not distributed in the same way as you'd expect from a human. The tools can synthesize a whole feature in a moderately complicated web app including UI code, schema changes, etc, and it comes out perfectly. Then I ask for something simple like a shopping list of windshield wipers etc for the cars and that comes out wildly wrong (like wrong number of wipers for the cars, not just the wrong parts), stuff that a ten year old child would have no trouble with. I work in the field so I have a qualitative understanding of this behavior but I think it can be extremely confusing to many people.

jerf 3 hours ago | parent | prev | next [-]

One of the reasons I'm comfortable using them as coding agents is that I can and do review every line of code they generate, and those lines of code form a gate. No LLM-bullshit can get through that gate, except in the form of lines of code, that I can examine, and even if I do let some bullshit through accidentally, the bullshit is stateless and can be extracted later if necessary just like any other line of code. Or, to put it another way, the context window doesn't come with the code, forming this huge blob of context to be carried along... the code is just the code.

That exposes me to when the models are objectively wrong and helps keep me grounded with their utility in spaces I can check them less well. One of the most important things you can put in your prompt is a request for sources, followed by you actually checking them out.

And one of the things the coding agents teach me is that you need to keep the AIs on a tight leash. What is their equivalent in other domains of them "fixing" the test to pass instead of fixing the code to pass the test? In the programming space I can run "git diff *_test.go" to ensure they didn't hack the tests when I didn't expect it. It keeps me wondering what the equivalent of that is in my non-programming questions. I have unit testing suites to verify my LLM output against. What's the equivalent in other domains? Probably some other isolated domains here and there do have some equivalents. But in general there isn't one. Things like "completely forged graphs" are completely expected but it's hard to catch this when you lack the tools or the understanding to chase down "where did this graph actually come from?".

The success with programming can't be translated naively into domains that lack the tooling programmers built up over the years, and based on how many times the AIs bang into the guardrails the tools provide I would definitely suggest large amounts of skepticism in those domains that lack those guardrails.

bensyverson 3 hours ago | parent | prev | next [-]

> the technology in its current state still isn't fit for purpose.

This is a broad statement that assumes we agree on the purpose.

For my purpose, which is software development, the technology has reached a level that is entirely adequate.

Meanwhile, sports trivia represents a stress test of the model's memorized world knowledge. It could work really well if you give the model a tool to look up factual information in a structured database. But this is exactly what I meant above; using the technology in a suboptimal way is a human problem, not a model problem.

the_snooze 2 hours ago | parent [-]

There's nothing in these models that say its purpose is software development. Their design and affordances scream out "use me for anything." The marketing certainly matches that, so do the UIs, so do the behaviors. So I take them at their word, and I see that failure modes are shockingly common even under regular use. I'm not out to break these things at all. I'm being as charitable and empirical as I can reasonably be.

If the purpose is indeed software development with review, then there's nothing stopping multi-billion dollar companies from putting friction into these sytems to direct users towards where the system is at its strongest.

nradov 2 hours ago | parent [-]

The LLM vendors are selling tokens. Why would they put friction into selling more tokens? Caveat emptor.

nradov 2 hours ago | parent | prev | next [-]

Which things actually matter? I think we can all agree that an LLM isn't fit for purpose to control a nuclear power plant or fly a commercial airliner. But there's a huge spectrum of things below that. If an LLM trading error causes some hedge fund to fail then so what? It's only money.

abraxas 2 hours ago | parent [-]

Not to mention that it would then make some hedge fund with a better backtesting harness or more AI scrutiny more successful thus keeping the financial market work as designed.

simianwords 2 hours ago | parent | prev | next [-]

> I stress test commercially deployed LLMs like Gemini and Claude with trivial tasks: sports trivia, fixing recipes, explaining board game rules, etc. It works well like 95% of the time. That's fine for inconsequential things. But you'd have to be deeply irresponsible to accept that kind of error rate on things that actually matter.

95% is not my experience and frankly dishonest.

I have ChatGPT open right now, can you give me examples where it doesn't work but some other source may have got it correct?

I have tested it against a lot of examples - it barely gets anything wrong with a text prompt that fits a few pages.

> The most intellectually honest way to evaluate these things is how they behave now on real tasks

A falsifiable way is to see how it is used in real life. There are loads of serious enterprise projects that are mostly done by LLMs. Almost all companies use AI. Either they are irresponsible or you are exaggerating.

Lets be actually intellectually honest here.

qsera 2 hours ago | parent [-]

>95% is not my experience and frankly dishonest.

Quite frankly, this is exactly like how two people can use the same compression program on two different files and get vastly different compression ratios (because one has a lot of redundancy and the other one has not).

simianwords 2 hours ago | parent [-]

I'm asking for a single example.

qsera 2 hours ago | parent [-]

But why do you need an example? Isn't it pretty well understood that LLMS will have trouble responding to stuff that is under represented in the training data?

You will just won't have any clue what that could be.

simianwords 2 hours ago | parent [-]

fair so it must be easy to give an example? I have ChatGPT open with 5.4-thinking. I'm honestly curious about what you can suggest since I have not been able to get it to bullshit easily.

qsera 2 hours ago | parent [-]

I am not the OP, an I have only used ChatGPT free version. Last day I asked it something. It answered. Then I asked it to provide sources. Then it provided sources, and also changed its original answer. When I checked the new answers it was wrong, and when I checked sources, it didn't actually contain the information that I asked for, and thus it hallucinated the answers as well as the sources...

simianwords an hour ago | parent [-]

I trust you. If it were happening so frequently you may be able to give me a single prompt to get it to bullshit?

floren 2 hours ago | parent | prev [-]

Six months bro, we're still so early