Remix.run Logo
zozbot234 7 months ago

They might make it work by (1) having lots of public domain content, for the purpose of training their models on basic language use, and (2) preserving source/attribution metadata about what copyrighted content they do use, so that the models can surface this attribution to the user during inference. Even if the latter is not 100% foolproof, it might still be useful in most cases and show good faith intent.

CaptainFever 7 months ago | parent [-]

The latter one is possible with RAG solutions like ChatGPT Search, which do already provide sources! :)

But for inference in general, I'm not sure it makes too much sense. Training data is not just about learning facts, but also (mainly?) about how language works, how people talk, etc. Which is kind of too fundamental to be attributed to, IMO. (Attribution: Humanity)

But who knows. Maybe it can be done for more fact-like stuff.

noitpmeder 7 months ago | parent | next [-]

Or this point, I'm sure there is more than enough publically and feely usable content to "learn how language works". There is no need to hoover up private or license-unclear content if that is your goal.

CaptainFever 7 months ago | parent [-]

I would actually love it if that was true. It would reduce a lot of legal headaches for sure. But if that was true, why were previous GPT versions not as good at understanding language? I can only conclude that it's because that's not actually true. There's not enough digital public domain materials to train a LLM to understand language competently.

Perhaps old texts in physical form, then? It'll cost a lot to digitize that, wouldn't it? And it wouldn't really be accessible to AI hobbyists. Unless the digitization is publicly funded or something.

(A big part of this is also how insanely long copyright lasts (nearly a hundred years!) that keeps most of the Internet's material from being public domain in the first place, but I won't belabour that point here.)

Edit:

Fair enough, I can see your point. "Surely it is cheaper to digitize old texts or buy a license to Google Books than to potentially lose a court case? Either OpenAI really likes risking it to save a bit of money, or they really wanted facts not contained in old texts."

And yeah, I guess that's true. I could say "but facts aren't copyrightable" (which was supported by the judge's decision from the TFA), but then that's a different debate about whether or not people should be able to own facts. Which does have some inroads (e.g. a right against being summarized because it removes the reason to read original news articles).

TeMPOraL 7 months ago | parent | prev [-]

> Training data is not just about learning facts, but also (mainly?) about how language works, how people talk, etc.

All of that and more, all at the same time.

Attribution at inference level is bound to work more-less the same way as humans attribute things during conversations: "As ${attribution} said, ${some quote}", or "I remember reading about it in ${attribution-1} - ${some statements}; ... or maybe it was in ${attribution-2}?...". Such attributions are often wrong, as people hallucinate^Wmisremember where they saw or heard something.

RAG obviously can work for this, as well as other solutions involving retrieving, finding or confirming sources. That's just like when a human actually looks up the source when citing something - and has similar caveats and costs.

CaptainFever 7 months ago | parent [-]

That sounds about right. When I ask ChatGPT about "ought implies can" for example, it cites Kant.