▲ | zozbot234 7 hours ago | |||||||
They might make it work by (1) having lots of public domain content, for the purpose of training their models on basic language use, and (2) preserving source/attribution metadata about what copyrighted content they do use, so that the models can surface this attribution to the user during inference. Even if the latter is not 100% foolproof, it might still be useful in most cases and show good faith intent. | ||||||||
▲ | CaptainFever 7 hours ago | parent [-] | |||||||
The latter one is possible with RAG solutions like ChatGPT Search, which do already provide sources! :) But for inference in general, I'm not sure it makes too much sense. Training data is not just about learning facts, but also (mainly?) about how language works, how people talk, etc. Which is kind of too fundamental to be attributed to, IMO. (Attribution: Humanity) But who knows. Maybe it can be done for more fact-like stuff. | ||||||||
|