The Problem with LLMs

> Translators are busy

No they're not. They're starving, struggling to find work and lamenting AI is eating their lunch. It's quite ironic that after complaining LLMs are plagiarism machines, the author thinks using them for translation is fine.

"LLMs are evil! Except when they're useful for me" I guess.

	▲	beering an hour ago \| parent [-]
		Simultaneously, if you hire human translators, you are likely to get machine translations. Maybe not often or overtly, but the translation industry has not been healthy for a while.

▲

hodgehog11 2 hours ago | parent | prev | next [-]

> "...it would sometimes regurgitate training data verbatim. That’s been patched in the years since..."

> "They are robots. Programs. Fancy robots and big complicated programs, to be sure — but computer programs, nonetheless."

This is totally misleading to anyone with less familiarity with how LLMs work. They are only programs in as much as they perform inference from a fixed, stored, statistical model. It turns out that treating them theoretically in the same way as other computer programs gives a poor representation of their behaviour.

This distinction is important, because no, "regurgitating data" is not something that was "patched out", like a bug in a computer program. The internal representations became more differentially private as newer (subtly different) training techniques were discovered. There is an objective metric by which one can measure this "plagiarism" in the theory, and it isn't nearly as simple as "copying" vs "not copying".

It's also still an ongoing issue and an active area of research, see [1] for example. It is impossible for the models to never "plagiarize" in the sense we think of while remaining useful. But humans repeat things verbatim too in little snippets, all the time. So there is some threshold where no-one seems to care anymore; think of it like the % threshold in something like Turnitin. That's the point that researchers would like to target.

Of course, this is separate from all of the ethical issues around training on data collected without explicit consent, and I would argue that's where the real issues lie.

[1] https://arxiv.org/abs/2601.02671

	▲	DoctorOetker 15 minutes ago \| parent \| next [-]
		To a large extent both "hallucinations" and "plagiarism" can be addressed with the same training method: source-aware training. https://arxiv.org/abs/2404.01019 At the frontier of science we have speculations, which until proper measurements become possible, are unknown to be true or false (or even unknown to be equivalent with other speculations etc. regardless of their being true or false, or truer or falser). Once settled we may call earlier but wrong speculations as "reasonable wrong guesses". In science it is important that these guesses or suspicions are communicated as it drives the design of future experiments. I argue that more important that "eliminating hallucinations" is tracing the reason it is or was believed by some. With source-aware training we can ask an LLM to give answers to a question (which may contradict each other), but to provide the training-source(s) justifying emission of each answer, instead of bluff it could emit multiple interpretations and go like: > answer A: according to school of thought A the answer is that ... examples of authors and places in my training set are: author+title a1, a2, a3, ... > answer B: according to author B: the answer to this question is ... which can be seen in articles b1, b2 > answer ...: ... > answer F: although I can't find a single document explaining this, when I collate the observation x in x1, x2, x3; observation y in y1,y2, ... , observation z in z1, z2, ... then I conclude the following: ... so it is clear which statements are sourced where, and which deductions are proper to the LLM. Obviously few to none of the high profile LLM providers will do this any time soon, because when jurisdictions learn this is possible they will demand all models to be trained source-aware, so that they can remunerate the authors in their jurisdiction (and levy taxes on their income). What fraction of the income will then go to authors and what fraction to the LLM providers? If any jurisdiction would be first to enforce this, it would probably be the EU, but they don't do it yet. If models are trained in a different jurisdiction than the one levying taxes the academic in-group citation game will be extended to LLMs: a US LLM will have incentive to only cite US sources when multiple are available, and a EU trained LLM will prefer to selectively cite european sources, etc.
	▲	oasisbob an hour ago \| parent \| prev [-]
		The plagiarism by the models is only part of it. Perhaps it's in such small pieces that it becomes difficult to care. I'm not convinced. The larger, and I'd argue more problematic, plagiarism is when people take this composite output of LLMs and pass it off as their own.

▲

woeirua 2 hours ago | parent | prev | next [-]

>As a quick aside, I am not going to entertain the notion that LLMs are intelligent, for any value of “intelligent.” They are robots. Programs. Fancy robots and big complicated programs, to be sure — but computer programs, nonetheless. The rest of this essay will treat them as such. If you are already of the belief that the human mind can be reduced to token regurgitation, you can stop reading here. I’m not interested in philosophical thought experiments.

I can't imagine why someone would want to openly advertise that they're so closed minded. Everything after this paragraph is just anti-LLM ranting.

▲

hodgehog11 an hour ago | parent | next [-]

I disagree that the majority of it is anti-LLM ranting, there are several subtle points here that are grounded in realism. You should read on past the first bit if you're judging mainly from the initial (admittedly naive) first few paragraphs.

	▲	woeirua 34 minutes ago \| parent [-]
		I read the rest of it. It was intellectually lazy.

▲

Cloudef an hour ago | parent | prev | next [-]

What's wrong about the statement? The black box algorithm might have been generated by machine learning, but it's still a computer program in the end.

▲

palmotea 22 minutes ago | parent | prev | next [-]

> I can't imagine why someone would want to openly advertise that they're so closed minded.

It's not being closed-minded. It's not wanting to get sea-lioned to death by obnoxious people.

▲

wolrah an hour ago | parent | prev | next [-]

> I can't imagine why someone would want to openly advertise that they're so closed minded.

I would say the exact same about you, rejecting an absolutely accurate and factual statement like that as closed minded strikes me as the same as the people who insist that medical science is closed minded about crystals and magnets.

I can't imagine why someone would want to openly advertise they think LLMs are actual intelligence, unless they were in a position to benefit financially from the LLM hype train of course.

	▲	woeirua 23 minutes ago \| parent [-]
		Cool, so clearly articulate the goal posts. What do LLMs have to do to convince you that they are intelligent? If the answer is there is no amount of evidence that can change your mind, then you're not arguing in good faith.

▲

acjohnson55 2 hours ago | parent | prev | next [-]

It was actually much less anti LLM than I was expecting from the beginning.

But I agree that it is self limiting to not bother to consider the ways that LLM inference and human thinking might be similar (or not).

To me, they seem do a pretty reasonable emulation of single- threaded thinking.

▲

Ygg2 an hour ago | parent | prev [-]

> I can't imagine why someone would want to openly advertise that they're so closed minded.

Because humans often anthropomorphize completely inert things? E.g. a coffee machine or a bomb disposal robot.

So far whatever behavior LLMs have shown is basically fueled by Sci-Fi stories of how a robot should behave under such and such.

▲

bronlund 2 hours ago | parent | prev | next [-]

Give it up. Buddha would not approve.

And there will be more compute for the rest of us :)

▲

CuriouslyC 2 hours ago | parent | prev | next [-]

Can we as a group agree to stop upvoting "AI is great" and "AI sucks" posts that don't make novel, meaningful arguments that provoke real thought? The plagiarism argument is thin and feels biased, the lock-in argument is counter to the market dynamics that are currently playing out, and in general the takes are just one dude's vibes.

▲

pixelmelt 2 hours ago | parent | next [-]

I dunno, I enjoyed reading about how the author personally feels about the act of working with them more then the whole "is this moral" part.

▲

gwern 2 hours ago | parent | prev | next [-]

I don't know, this one is a little novel. I've never seen the developer of a Buddhist meditation app discuss whether to use LLMs with a paragraph like:

> Pariyatti’s nonprofit mission, it should be noted, specifically incorporates a strict code of ethics, or sīla: not to kill, not to steal, not to engage in sexual misconduct, not to lie, and not to take intoxicants.

Not a whole lot of Pali in most LLM editorials.

	▲	akoboldfrying 2 hours ago \| parent [-]
		> not to engage in sexual misconduct I must remember to add this quality guarantee to my own software projects. My software projects are also uranium-free.

▲

ares623 2 hours ago | parent | prev [-]

> The plagiarism argument is thin and feels biased

are you being serious with this one

▲

CuriouslyC 2 hours ago | parent [-]

If you're already sold on the plagiarism narrative that big entertainment is trying to propagandize in order to get leverage against the tech companies, nothing I say is going to change your mind.

	▲	KittenInABox 2 hours ago \| parent [-]
		I don't really know what you mean by "big entertainment" trying to get leverage against tech companies. Tech companies are behemoths. Most of the artists I know fretting about AI don't earn half a junior engineer's salary. And this is coming from someone who is relatively bullish on AI. I just don't think the framing of "big entertainment" makes any sense at all.

▲

anonu 2 hours ago | parent | prev | next [-]

I stopped reading after "problem with LLMs is plagiarism"...

	▲	acjohnson55 2 hours ago \| parent [-]
		Too bad. You missed some interesting stuff. And I say that as someone who sees some of this very differently than the author. Announcing that one line of the piece made you mad without providing any other thought is not very constructive.

▲

bayarearefugee 3 hours ago | parent | prev | next [-]

> LLMs will always be plagiarism machines but in 40 years we might not care.

40 years?

Virtually nobody cares about this already... today.

(I'm not refuting the author's claim that LLMs are built on plagiarism, just noting how the world has collectively decided to turn a blind eye to it)

▲

DiogenesKynikos an hour ago | parent | prev [-]

The same could be said of humans too. Humans are made of cells that work deterministically. Sure, humans are fancy, big complicated combinations of cells - but they're cells, nonetheless.

That view of humans - and LLMs - ignores the fact that when you combine large numbers of simple building blocks, you can get completely novel behavior. Protons, neutrons and electrons come together to create chemistry. Molecules come together to create biological systems. A bunch of neurons taken together created the poetry of Shakespeare.

Unless you have a dualistic view of the world, in which the mind is a separate realm that exists independently of matter and does not arise from neurons interacting in our brains, you have to accept that robots can be intelligent. Just to put this more sharply: Would a perfect simulation of a human brain be intelligent or not? If you answer "no," then you believe that thought comes from some other, immaterial realm, not from our brains.

▲

Ygg2 an hour ago | parent [-]

> That view of humans - and LLMs - ignores the fact that when you combine large numbers of simple building blocks, you can get completely novel behavior.

I can bang smooth rocks to get sharper rocks; that doesn't make sharper rocks more intelligent. Makes them sharper, though.

Which is to say, novel behavior != intelligence.

	▲	refactor_master 28 minutes ago \| parent [-]
		Yes, that seems to hold for rocks. But that doesn’t shut down the original post’s premise, unless you hold the answer to what can and cannot be banged together to create emergent intelligence.