Remix.run Logo
DoctorOetker 2 hours ago

To a large extent both "hallucinations" and "plagiarism" can be addressed with the same training method: source-aware training.

https://arxiv.org/abs/2404.01019

At the frontier of science we have speculations, which until proper measurements become possible, are unknown to be true or false (or even unknown to be equivalent with other speculations etc. regardless of their being true or false, or truer or falser). Once settled we may call earlier but wrong speculations as "reasonable wrong guesses". In science it is important that these guesses or suspicions are communicated as it drives the design of future experiments.

I argue that more important that "eliminating hallucinations" is tracing the reason it is or was believed by some.

With source-aware training we can ask an LLM to give answers to a question (which may contradict each other), but to provide the training-source(s) justifying emission of each answer, instead of bluff it could emit multiple interpretations and go like:

> answer A: according to school of thought A the answer is that ... examples of authors and places in my training set are: author+title a1, a2, a3, ...

> answer B: according to author B: the answer to this question is ... which can be seen in articles b1, b2

> answer ...: ...

> answer F: although I can't find a single document explaining this, when I collate the observation x in x1, x2, x3; observation y in y1,y2, ... , observation z in z1, z2, ... then I conclude the following: ...

so it is clear which statements are sourced where, and which deductions are proper to the LLM.

Obviously few to none of the high profile LLM providers will do this any time soon, because when jurisdictions learn this is possible they will demand all models to be trained source-aware, so that they can remunerate the authors in their jurisdiction (and levy taxes on their income). What fraction of the income will then go to authors and what fraction to the LLM providers? If any jurisdiction would be first to enforce this, it would probably be the EU, but they don't do it yet. If models are trained in a different jurisdiction than the one levying taxes the academic in-group citation game will be extended to LLMs: a US LLM will have incentive to only cite US sources when multiple are available, and a EU trained LLM will prefer to selectively cite european sources, etc.