▲	hodgehog11 3 hours ago
		> "...it would sometimes regurgitate training data verbatim. That’s been patched in the years since..." > "They are robots. Programs. Fancy robots and big complicated programs, to be sure — but computer programs, nonetheless." This is totally misleading to anyone with less familiarity with how LLMs work. They are only programs in as much as they perform inference from a fixed, stored, statistical model. It turns out that treating them theoretically in the same way as other computer programs gives a poor representation of their behaviour. This distinction is important, because no, "regurgitating data" is not something that was "patched out", like a bug in a computer program. The internal representations became more differentially private as newer (subtly different) training techniques were discovered. There is an objective metric by which one can measure this "plagiarism" in the theory, and it isn't nearly as simple as "copying" vs "not copying". It's also still an ongoing issue and an active area of research, see [1] for example. It is impossible for the models to never "plagiarize" in the sense we think of while remaining useful. But humans repeat things verbatim too in little snippets, all the time. So there is some threshold where no-one seems to care anymore; think of it like the % threshold in something like Turnitin. That's the point that researchers would like to target. Of course, this is separate from all of the ethical issues around training on data collected without explicit consent, and I would argue that's where the real issues lie. [1] https://arxiv.org/abs/2601.02671
	▲	oasisbob 2 hours ago \| parent \| next [-]
		The plagiarism by the models is only part of it. Perhaps it's in such small pieces that it becomes difficult to care. I'm not convinced. The larger, and I'd argue more problematic, plagiarism is when people take this composite output of LLMs and pass it off as their own.
	▲	DoctorOetker 2 hours ago \| parent \| prev [-]
		To a large extent both "hallucinations" and "plagiarism" can be addressed with the same training method: source-aware training. https://arxiv.org/abs/2404.01019 At the frontier of science we have speculations, which until proper measurements become possible, are unknown to be true or false (or even unknown to be equivalent with other speculations etc. regardless of their being true or false, or truer or falser). Once settled we may call earlier but wrong speculations as "reasonable wrong guesses". In science it is important that these guesses or suspicions are communicated as it drives the design of future experiments. I argue that more important that "eliminating hallucinations" is tracing the reason it is or was believed by some. With source-aware training we can ask an LLM to give answers to a question (which may contradict each other), but to provide the training-source(s) justifying emission of each answer, instead of bluff it could emit multiple interpretations and go like: > answer A: according to school of thought A the answer is that ... examples of authors and places in my training set are: author+title a1, a2, a3, ... > answer B: according to author B: the answer to this question is ... which can be seen in articles b1, b2 > answer ...: ... > answer F: although I can't find a single document explaining this, when I collate the observation x in x1, x2, x3; observation y in y1,y2, ... , observation z in z1, z2, ... then I conclude the following: ... so it is clear which statements are sourced where, and which deductions are proper to the LLM. Obviously few to none of the high profile LLM providers will do this any time soon, because when jurisdictions learn this is possible they will demand all models to be trained source-aware, so that they can remunerate the authors in their jurisdiction (and levy taxes on their income). What fraction of the income will then go to authors and what fraction to the LLM providers? If any jurisdiction would be first to enforce this, it would probably be the EU, but they don't do it yet. If models are trained in a different jurisdiction than the one levying taxes the academic in-group citation game will be extended to LLMs: a US LLM will have incentive to only cite US sources when multiple are available, and a EU trained LLM will prefer to selectively cite european sources, etc.