The aspect it's supposed to copy is the statistics of how words work.

And in general, when an LLM is able to recreate text that's a training error. Recreating text is not the purpose. Which is not to excuse it happening, but the distinction matters.

▲

program_whiz 4 days ago | parent [-]

In training, the model is trained to predict the exact sequence of words of a text. In other words, it is reproducing the text repeatedly for its own trainings. The by-product of this training is that it influences model weights to make the text more likely to be produced by the model -- that is its explicit goal. A perfect model would be able to reproduce the text perfectly (0 loss).

Real-world absurd example: A company hires a bunch of workers. They then give them access to millions of books and have the workers reading the books all day. The workers copy the books word by word, but after each word try to guess the next word that will appear. Eventually, they collectively become quite good at guessing the next word given a prompt text, even reproducing large swaths of text almost verbatim. The owner of company Y claims they owe nothing to the book owners, because it doesn't count as reading the book, and any reproduction is "coincidental" (even though this is the explicit task of the readers). They then use these workers to produce works to compete with the authors of the books, which they never paid for.

It seems many people feel this is "fair use" when it happens on a computer, but would call it "stealing" if I pirated all the books of JK Rowling to train myself to be a better mimicker of her style. If you feel this is still fair use, then you should agree all books should be free to everyone (as well as art, code, music, and any other training material).

▲

gruez 4 days ago | parent | next [-]

>but would call it "stealing" if I pirated all the books of JK Rowling to train myself to be a better mimicker of her style

Can you provide an example of someone being successfully sued for "mimicking style", presumably in the US judicial system?

	▲	snowe2010 3 days ago \| parent \| next [-]
		> Second, the songs must share SUBSTANTIAL SIMILARITY, which means a listener can hear the songs side by side and tell the allegedly infringing song lifted, borrowed, or appropriated material from the original. Music has had this happen numerous times in the US. The distinction isn’t an exact replica, it’s if it could be confused for the same style. George Harrison lost a case for one of his songs. There are many others. https://ultimateclassicrock.com/george-harrison-my-sweet-lor...
	▲	program_whiz 4 days ago \| parent \| prev \| next [-]
		The damages arise from the very process of stealing material for training. The justification "yes but my training didn't cause me to directly copy the works" is faulty. I won't rehash the many arguments as to why the output is also a violation, but my point was more the absurd view that stealing and using all the data in the world isn't a problem because the output is a lossy encoding (but the explicit training objective is to reproduce the training text / image).
	▲	Retric 4 days ago \| parent \| prev [-]
		Style in an ambiguous term here as it doesn’t directly map to what’s being considered. The case between “Blurred Lines” and “Got to Give It Up” is often considered one of style and the Court of Appeals for the Ninth Circuit upheld copyright infringement. However, AI has been show to copy a lot more than what people consider style.

▲

Dylan16807 4 days ago | parent | prev [-]

> In training, the model is trained to predict the exact sequence of words of a text. In other words, it is reproducing the text repeatedly for its own trainings.

That's called extreme overfitting. Proper training is supposed to give subtle nudges toward matching each source of text, and zillions of nudges slowly bring the whole thing into shape based on overall statistics and not any particular sources. (But that does require properly removing duplicate sources of very popular text which seems to be an unsolved problem.)

So your analogy is far enough off that I can't give it a good reply.

> It seems many people feel this is "fair use" when it happens on a computer, but would call it "stealing" if I pirated all the books of JK Rowling to train myself to be a better mimicker of her style.

I haven't seen anyone defend the piracy, and the piracy is what this settlement is about.

People are defending the training itself.

And I don't think anyone would seriously say the AI version is fair use but the human version isn't. You really think "many people" feel that way?

▲

Retric 4 days ago | parent [-]

There isn’t a clear line for extreme overfitting here.

To generate working code the output must follow the API exactly. Nothing separates code and natural language as far as the underlying algorithm is concerned.

Companies slightly randomize output to minimize the likelihood of direct reproduction of source material, but that’s independent of what the neural network is doing.

▲

Dylan16807 4 days ago | parent [-]

You want different levels of fitting for different things, which is difficult. Tight fighting on grammar and APIs and idioms, loose fitting on creative text, and it's hard to classify it all up front. But still, if it can recite harry potter that's not on purpose, and it's never trained to predict a specific source losslessly.

And it's not really about randomizing output. The model gives you a list of likely words, often with no clear winner. You have to pick one somehow. It's not like it's taking some kind of "real" output and obfuscating it.

▲

Retric 4 days ago | parent [-]

> often with no clear winner. You have to pick one somehow. It's not like it's taking some kind of "real" output and obfuscating it.

It’s very rare for multiple outputs to actually be equal so the only choice is to choose one at random. Instead its become accepted practice to make sub optimal choices for a few reasons, one of which really is to decrease the likelihood of reproducing existing text.

Nobody wants a headline like: “Meta's Llama 3.1 can recall 42 percent of the first Harry Potter book” https://www.understandingai.org/p/metas-llama-31-can-recall-...

▲

Dylan16807 4 days ago | parent [-]

I will say that picking the most likely word every single time isn't optimal.

	▲	Retric 4 days ago \| parent [-]
		I agree there’s multiple reasons to slightly randomize output, but there’s also downsides.