Copyright should be about copying rights, not statistical similarities. Similarity vs causal link - a different standard all together.

▲

gruez 3 days ago | parent | next [-]

>Copyright should be about copying rights, not statistical similarities

So you're agreeing with me? The courts have been pretty clear on what's copyrightable. Copyrights only protect specific expressions of an idea. You can copyright your specific writing of a recipe, but not the concept of the dish or the abstract instructions itself.

▲

dotnet00 3 days ago | parent | prev | next [-]

Those statistical similarities originate from a copyright violation, there's your causal link. Basically the same as selling a game made using pirated Photoshop.

▲

reissbaker 3 days ago | parent | next [-]

Selling a game whose assets were made with a pirated copy of Photoshop does not extend Adobe's copyright to cover your game itself. They can sue you for using the pirated copy of Photoshop, but they can't extend copyright vampirically in that manner — at least, not in the United States.

(They can still sue for damages, but they can't claim copyright over your game itself.)

▲

thaumasiotes 3 days ago | parent | next [-]

Well, there are damages torts and there's also an unjust enrichment tort. In the paradigm example where you make funding available to your treasurer and he makes an unscheduled stop in Las Vegas to bet it on black, you can sue him for damages. If he lost the bet, he owes you the amount he lost. If he won, he owes you nothing (assuming he went on and deposited the full amount in your treasury as expected).

Or you could sue him on a theory of unjust enrichment, in which case, if he lost, he'd owe you nothing, and if he won, he'd owe you all of his winnings.

It's not clear to me why the same theory wouldn't be available to Adobe, though the copyright question wouldn't be the main thrust of the case then.

▲

dotnet00 3 days ago | parent | prev | next [-]

Are the authors claiming copyright over the LLM? My understanding is they were suing Anthropic for using the authors' data in their training product. The court ruled that using the books for training would be fair use, but that piracy is not fair use.

Thus, isn't the settlement essentially Anthropic admitting that they don't really have an effective defense against the piracy claim?

	▲	reissbaker 2 days ago \| parent [-]
		Oh I don't disagree that the authors may have a compelling case — I was just responding to the statistical similarities vs copying argument. Anthropic may have violated the authors rights, but technically that doesn't extend copyright via a "causal link." The authors can still sue for damages though (and did, and had a strong enough case Anthropic is trying to settle for over a billion dollars).

▲

gowld 3 days ago | parent | prev [-]

What is illegal about using pirated software that someone else distributed to you, if you never agreed to a license contract?

▲

dotnet00 3 days ago | parent [-]

If you can show that the pirated copy was provided to you without your knowledge, and that there was no reasonable way for you to know that it was pirated, there probably isn't anything illegal about it for you.

But otherwise, you're essentially asking if you can somehow bypass license agreements by simply refusing to read them, which would obviously render all licensing useless.

▲

thaumasiotes 3 days ago | parent [-]

Why do you think reading the agreement is notionally mandatory before the software becomes functional?

▲

dotnet00 3 days ago | parent [-]

Most paid software generally makes you acknowledge that you have read and accepted the terms of the license before first use, and includes a clause that continued use of the software constitutes acceptance of the license terms.

In the event that you try to play games to get around that acknowledgement: Courts aren't machines, they can tell that you're acting in bad faith to avoid license restrictions and can punish you appropriately.

▲

thaumasiotes 3 days ago | parent [-]

>> Why do you think reading the agreement is notionally mandatory before the software becomes functional?

> Most paid software generally makes you acknowledge that you have read and accepted the terms of the license before first use, and includes a clause that continued use of the software constitutes acceptance of the license terms.

Huh. If only I'd known that.

Why do you think that is?

▲

dotnet00 3 days ago | parent [-]

How about you directly say what you're trying to say instead of being unnecessarily sarcastic?

	▲	thaumasiotes 3 days ago \| parent [-]
		If it's possible to use software without agreeing to the license, then the license really doesn't bind the user. That's why people try to make it mandatory. Why did you think it made sense to respond to the question "Why do you think X is true?" with "Did you know that X is true?"?

▲

terminalshort 3 days ago | parent | prev [-]

The statistical similarities originate from fair use, just as the judge ruled in this case.

▲

Retric 3 days ago | parent | prev [-]

The entire purpose of training materials is to copy aspects of them. That’s the causal link.

▲

visarga 3 days ago | parent | next [-]

> That’s the causal link.

But copyright was based on substantial similarity, not causal links. That is the subtle change. Copyright is expanding more and more.

In my view, unless there is substantially similarity to the infringed work, copyright should not be invoked.

Even the substantial similarity concept is already an expanded concept from original "protected expression".

It makes no sense to attack gen-AI for infringement, if we wanted the originals we would get the originals, you can copy anything you like on the web. Generating bootleg Harry Potter is slow, expensive and unfaithful to the original. We use gen-AI for creating things different from the training data.

	▲	Retric 2 days ago \| parent [-]
		Substantial similarly is less stringent than causal links. With substantial similarity the worlds’s a landline of unpopular media. Copyright isn’t supposed to apply if you happen to write a story that bares an uncanny similarity to a story you never read written in 1952 in a language you don’t know that sold 54 copies.

▲

Dylan16807 3 days ago | parent | prev [-]

The aspect it's supposed to copy is the statistics of how words work.

And in general, when an LLM is able to recreate text that's a training error. Recreating text is not the purpose. Which is not to excuse it happening, but the distinction matters.

▲

program_whiz 3 days ago | parent [-]

In training, the model is trained to predict the exact sequence of words of a text. In other words, it is reproducing the text repeatedly for its own trainings. The by-product of this training is that it influences model weights to make the text more likely to be produced by the model -- that is its explicit goal. A perfect model would be able to reproduce the text perfectly (0 loss).

Real-world absurd example: A company hires a bunch of workers. They then give them access to millions of books and have the workers reading the books all day. The workers copy the books word by word, but after each word try to guess the next word that will appear. Eventually, they collectively become quite good at guessing the next word given a prompt text, even reproducing large swaths of text almost verbatim. The owner of company Y claims they owe nothing to the book owners, because it doesn't count as reading the book, and any reproduction is "coincidental" (even though this is the explicit task of the readers). They then use these workers to produce works to compete with the authors of the books, which they never paid for.

It seems many people feel this is "fair use" when it happens on a computer, but would call it "stealing" if I pirated all the books of JK Rowling to train myself to be a better mimicker of her style. If you feel this is still fair use, then you should agree all books should be free to everyone (as well as art, code, music, and any other training material).

▲

gruez 3 days ago | parent | next [-]

>but would call it "stealing" if I pirated all the books of JK Rowling to train myself to be a better mimicker of her style

Can you provide an example of someone being successfully sued for "mimicking style", presumably in the US judicial system?

	▲	snowe2010 3 days ago \| parent \| next [-]
		> Second, the songs must share SUBSTANTIAL SIMILARITY, which means a listener can hear the songs side by side and tell the allegedly infringing song lifted, borrowed, or appropriated material from the original. Music has had this happen numerous times in the US. The distinction isn’t an exact replica, it’s if it could be confused for the same style. George Harrison lost a case for one of his songs. There are many others. https://ultimateclassicrock.com/george-harrison-my-sweet-lor...
	▲	program_whiz 3 days ago \| parent \| prev \| next [-]
		The damages arise from the very process of stealing material for training. The justification "yes but my training didn't cause me to directly copy the works" is faulty. I won't rehash the many arguments as to why the output is also a violation, but my point was more the absurd view that stealing and using all the data in the world isn't a problem because the output is a lossy encoding (but the explicit training objective is to reproduce the training text / image).
	▲	Retric 3 days ago \| parent \| prev [-]
		Style in an ambiguous term here as it doesn’t directly map to what’s being considered. The case between “Blurred Lines” and “Got to Give It Up” is often considered one of style and the Court of Appeals for the Ninth Circuit upheld copyright infringement. However, AI has been show to copy a lot more than what people consider style.

▲

Dylan16807 3 days ago | parent | prev [-]

> In training, the model is trained to predict the exact sequence of words of a text. In other words, it is reproducing the text repeatedly for its own trainings.

That's called extreme overfitting. Proper training is supposed to give subtle nudges toward matching each source of text, and zillions of nudges slowly bring the whole thing into shape based on overall statistics and not any particular sources. (But that does require properly removing duplicate sources of very popular text which seems to be an unsolved problem.)

So your analogy is far enough off that I can't give it a good reply.

> It seems many people feel this is "fair use" when it happens on a computer, but would call it "stealing" if I pirated all the books of JK Rowling to train myself to be a better mimicker of her style.

I haven't seen anyone defend the piracy, and the piracy is what this settlement is about.

People are defending the training itself.

And I don't think anyone would seriously say the AI version is fair use but the human version isn't. You really think "many people" feel that way?

▲

Retric 3 days ago | parent [-]

There isn’t a clear line for extreme overfitting here.

To generate working code the output must follow the API exactly. Nothing separates code and natural language as far as the underlying algorithm is concerned.

Companies slightly randomize output to minimize the likelihood of direct reproduction of source material, but that’s independent of what the neural network is doing.

▲

Dylan16807 3 days ago | parent [-]

You want different levels of fitting for different things, which is difficult. Tight fighting on grammar and APIs and idioms, loose fitting on creative text, and it's hard to classify it all up front. But still, if it can recite harry potter that's not on purpose, and it's never trained to predict a specific source losslessly.

And it's not really about randomizing output. The model gives you a list of likely words, often with no clear winner. You have to pick one somehow. It's not like it's taking some kind of "real" output and obfuscating it.

▲

Retric 3 days ago | parent [-]

> often with no clear winner. You have to pick one somehow. It's not like it's taking some kind of "real" output and obfuscating it.

It’s very rare for multiple outputs to actually be equal so the only choice is to choose one at random. Instead its become accepted practice to make sub optimal choices for a few reasons, one of which really is to decrease the likelihood of reproducing existing text.

Nobody wants a headline like: “Meta's Llama 3.1 can recall 42 percent of the first Harry Potter book” https://www.understandingai.org/p/metas-llama-31-can-recall-...

▲

Dylan16807 3 days ago | parent [-]

I will say that picking the most likely word every single time isn't optimal.

	▲	Retric 3 days ago \| parent [-]
		I agree there’s multiple reasons to slightly randomize output, but there’s also downsides.