Remix.run Logo
MelonUsk 3 hours ago

Yep, that’s true

But those links are Googled after the model started to answer, they are not the links to the training data

Imagine an artificial “librarian” that read all the books and spits hallucinated quotes for you

But doesn’t let you enter the library, open a single book or even see the sources for those hallucinated quotes

But instead Googles some sources based on hallucinations after generating them ;-)

It’s better than nothing but you can Google them, too, while training data (the library) is completely hidden from you, even the public domain parts of it - zero attribution

dmortin 2 hours ago | parent [-]

There should be at least some correlation. When building the model they give more weight to some pages (e.g. Wikipedia) which have bigger trust (pagerank?). And when they provide links in answers, those matches are listed first which have better pagerank for the query.

So if it sources something in Wikipedia, it is more likely to provide Wikipedia as a trusted source for it.

The problem is when an answer is hallucinated, false, it may provide a source for it which contains the invalid info.

MelonUsk 2 hours ago | parent [-]

Yep, a few non-profits work on direct training data attribution:

OlmoTrace, Guide Labs with Clarity and a few more

Labs train the model with attribution baked-in and they say the bigger the model - the more interpretable it becomes

Pretty sure it’s the future