I tend to think that the reason people over-index on complex use-cases for LLMs is actually reliability, not a lack of interest in boring projects.

If an LLM can solve a complex problem 50% of the time, then that is still very valuable. But if you are writing a system of small LLMs doing small tasks, then even 1% error rates can compound into highly unreliable systems when stacked together.

The cost of LLMs occasionally giving you wrong answers is worth it for answers to harder tasks, in a way that it is not worth it for smaller tasks. For those smaller tasks, usually you can get much closer to 100% reliability, and more importantly much greater predictability, with hand-engineered code. This makes it much harder to find areas where small LLMs can add value for small boring tasks. Better auto-complete is the only real-world example I can think of.

▲

a_bonobo 4 days ago | parent | next [-]

>If an LLM can solve a complex problem 50% of the time, then that is still very valuable

I'd adjust that statement - If an LLM can solve a complex problem 50% of the time and I can evaluate correctness of the output, then that is still very valuable. I've seen too many people blindly pass on LLM output - for a short while it was a trend in the scientific literature to have LLMs evaluate output of other LLMs? Who knows how correct that was. Luckily that has ended.

▲

danpalmer 4 days ago | parent | next [-]

> I've seen too many people blindly pass on LLM output

I misread this the first time and realised both interpretations are happening. I've seen people copy-paste out of ChatGPT without reading, and I've seen people "pass on" or reject content simply because it has been AI generated.

▲

adastra22 4 days ago | parent | prev | next [-]

> for a short while it was a trend in the scientific literature to have LLMs evaluate output of other LLMs? Who knows how correct that was.

Highly reliable. So much so that is basically how modern LLMs work internally. Also speaking from personal experience in the projects I work on, it is the chief way to counteract hallucination, poisoned context windows, and scaling beyond the interaction limit.

LLMs evaluating LLM output works surprisingly well.

▲

sothatsit 4 days ago | parent | prev | next [-]

True! This is what has me more excited about LLMs producing Lean proofs than written maths proofs. The Lean proofs can be proved to be correct, whereas the maths proofs require experts to verify them and look for mistakes.

That said, I do think there are lots of problems where verification is easier than doing the task itself, especially in computer science. I think it is easier to list tasks that aren't easier to verify than to do from scratch actually. Security is one major one.

▲

hansvm 4 days ago | parent [-]

Even there it's risky. LLMs are good at subtly misstating the problem, so it's relatively easy to make them prove things which look like the thing you wanted but which are mostly unrelated.

	▲	sothatsit 4 days ago \| parent [-]
		Yes, Lean only lets you be confident in the contents of the proof, not how it was formed. But, I still think that's pretty cool and valuable.

▲

empiko 4 days ago | parent | prev [-]

> Who knows how correct that was. Luckily that has ended.

What do you mean it ended? I still see tons of NLP papers with this methodology.

▲

raincole 4 days ago | parent | prev [-]

Yeah. Is it even proven that LLMs don't hallucinate for smaller tasks? The author seems to imply that. I fail to see how it could be true.

	▲	adastra22 4 days ago \| parent [-]
		No? That is trivially not the case. Ask an LLM something outside its training data and it will hallucinate the answer. How can it do anything else? Maybe its hallucination ends up being correct, but not all of the time.