Does anyone know, from a technical standpoint, why are citations such a problem for LLMs?

I realize things are probably (much) more complicated than I realize, but programmatically, unlike arbitrary text, citations are generally strings with a well-defined format. There are literally "specs" for citation formats in various academic, legal, and scientific fields.

So, naively, one way to mitigate these hallucinations would be identify citations with a bunch of regexes, and if one is spotted, use the Google Scholar API (or whatever) to make sure it's real. If not, delete it or flag it, etc.

Why isn't something like this obvious solution being done? My guess is that it would slow things down too much. But it could be optional and it could also be done after the output is generated by another process.

▲

Muller20 a day ago | parent [-]

In general, a citation is something that needs to be precise, while LLMs are very good at generating some generic high probability text not grounded in reality. Sure, you could implement a custom fix for the very specific problem of citations, but you cannot solve all kinds of hallucinations. After all, if you could develop a manual solution you wouldn't use an LLM.

There are some mitigations that are used such as RAG or tool usage (e.g. a browser), but they don't completely fix the underlying issue.

▲

jordanpg a day ago | parent [-]

My point is that citations are constantly making headlines, yet at least at first glance, seems like an eminently solvable problem.

	▲	ml-anon a day ago \| parent [-]
		So solve it?