| ▲ | ben_w 4 hours ago | |||||||
We saw partial copies of large or rare documents, and full copies of smaller widely-reproduced documents, not full copies of everything. An e.g. 1 trillion parameter model is not a lossless copy of a ten-petabyte slice of plain text from the internet. The distinction may not have mattered for copyright laws if things had gone down differently, but the gap between "blurry JPEG of the internet" and "learned stuff" is more obviously important when it comes to e.g. "can it make a working compiler?" | ||||||||
| ▲ | tza54j 3 hours ago | parent | next [-] | |||||||
We are here in a clean room implementation thread, and verbatim copies of entire works are irrelevant to that topic. It is enough to have read even parts of a work for something to be considered a derivative. I would also argue that language models who need gargantuan amounts of training material in order to work by definition can only output derivative works. It does not help that certain people in this thread (not you) edit their comments to backpedal and make the followup comments look illogical, but that is in line with their sleazy post-LLM behavior. | ||||||||
| ||||||||
| ▲ | antirez 3 hours ago | parent | prev | next [-] | |||||||
Besides, the fact an LLM may recall parts of certain documents, like I can recall incipits of certain novels, does not mean that when you ask LLM of doing other kind of work, that is not recalling stuff, the LLM will mix such things verbatim. The LLM knows what it is doing in a variety of contexts, and uses the knowledge to produce stuff. The fact that for many people LLMs being able to do things that replace humans is bitter does not mean (and is not true) that this happens mainly using memorization. What coding agents can do today have zero explanation with memorization of verbatim stuff. So it's not a matter of copyright. Certain folks are fighting the wrong battle. | ||||||||
| ||||||||
| ▲ | philipportner 2 hours ago | parent | prev | next [-] | |||||||
Granted, these are some of the most widely spread texts, but just fyi: https://arxiv.org/pdf/2601.02671 > For Claude 3.7 Sonnet, we were able to extract four whole books near-verbatim, including two books under copyright in the U.S.: Harry Potter and the Sorcerer’s Stone and 1984 (Section 4). | ||||||||
| ||||||||
| ▲ | boroboro4 3 hours ago | parent | prev [-] | |||||||
While I mostly agree with you, it worth noting modern llms are trained on 10-20-30T of tokens which is quite comparable to their size (especially given how compressible the data is) | ||||||||