Remix.run Logo
visarga 3 days ago

How is it fair? Do you expect 9,000 from Google, Meta, OpenAI, and everyone else? Were your books imitated by AI?

Infringement was supposed to imply substantial similarity. Now it is supposed to mean statistical similarity?

jawns 3 days ago | parent | next [-]

You've misunderstood the case.

The suit isn't about Anthropic training its models using copyrighted materials. Courts have generally found that to be legal.

The suit is about Anthropic procuring those materials from a pirated dataset.

The infringement, in other words, happened at the time of procurement, not at the time of training.

If it had procured them from a legitimate source (e.g. licensed them from publishers) then the suit wouldn't be happening.

greensoap 3 days ago | parent | next [-]

A point of clarifications and some questions.

The portion the court said was bad was not Anthropic getting books from pirated sites to train its model. The court opined that training the model was fair use and did not distinguish between getting the books from pirated sites or hard copy scans. The part the court said was bad, which was settled, was Anthropic getting books from a pirate site to store in a general purpose library.

--

  "To summarize the analysis that now follows, the use of the books at issue to train Claude
  and its precursors was exceedingly transformative and was a fair use under Section 107 of the
  Copyright Act. And, the digitization of the books purchased in print form by Anthropic was. 
  also a fair use but not for the same reason as applies to the training copies. Instead, it was a
  fair use because all Anthropic did was replace the print copies it had purchased for its central
  library with more convenient space-saving and searchable digital copies for its central
  library — without adding new copies, creating new works, or redistributing existing copies.
  However, Anthropic had no entitlement to use pirated copies for its central library. Creating a
  permanent, general-purpose library was not itself a fair use excusing Anthropic’s piracy."

  "Because the legal issues differ between the *library copies* Anthropic purchased and
  pirated, this order takes them in turn."

--

Questions

As an author do you think it matters where the book was copied from? Presumably, a copyright gives the author the right to control when a text is reproduced and distributed. If the AI company buys a book and scans it, they are reproducing the book without a license, correct? And fair use is the argument that even though they violated the copyright, they are execused. In a pure sense, if the AI company copied (assuming they didn't torrent back the book) from a "pirate source" why is that copy worse then if they copied from a hard book?

8note 3 days ago | parent | next [-]

> AI company buys a book and scans it, they are reproducing the book without a license, correct

isn't digitizing your own copies as backups and personal use fine? so long as you dont give away the original while keeping the backups. similarly, dont give away the digital copies.

esrauch 3 days ago | parent [-]

It is, Google Books did it over a decade ago (bought up physical books and scanned them all). There were some rulings about how much of a snippet they were allowed to show end users as fair use, but I'm fairly sure the actual scanning and indexing of the books was always allowed.

cortesoft 3 days ago | parent | prev [-]

> If the AI company buys a book and scans it, they are reproducing the book without a license, correct?

No? I think there are a lot more details that need to be known before answering this question. It matters what they do with it after they scan it.

greensoap 3 days ago | parent [-]

That is only relevant to whether it is fair use not to whether the copying is an infringement. Fair use is what is called an affirmative defense -- it means that yes what I did was technically a violation but is forgiven. So on technicalities the copying is an infringement but that infringement is "okay" because there is a fair use. A different scenario is if the copyright owner gives you a license to copy the work (like open source licenses). In that scenario the copying was not an infringement because a license exists.

gpm 3 days ago | parent | next [-]

> Fair use is what is called an affirmative defense

Yes

> it means that yes what I did was technically a violation but is forgiven

Not at all. All "affirmative defence" means is that procedural the burden is on me to establish that I was not violating the law. The law isn't "you can't do the thing", rather it is "you can't do the thing unless its like this". There is no violation, there is no forgiveness as there is nothing to forgive, because it was done "like this" and doing it "like this" doesn't violate the law in the first place.

cortesoft 3 days ago | parent | prev [-]

If I have have an app on my phone that lets me point my phone at a page to scan, OCR, and read the page out loud to me, it wouldn't even require fair use, would it?

mmargenot 3 days ago | parent | prev | next [-]

Do foundation model companies need to license these books or simply purchase them going forward?

sharkjacobs 3 days ago | parent | next [-]

> On June 23, 2025, the Court rendered its Order on Fair Use, Dkt. 231, granting Anthropic’s motion for summary judgment in part and denying its motion in part. The Court reached different conclusions regarding different sources of training data. It found that reproducing purchased and scanned books to train AI constituted fair use. Id. at 13-14, 30–31. However, the Court denied summary judgment on the copyright infringement claims related to the works Anthropic obtained from Library Genesis and Pirate Library Mirror. Id. at 19, 31.

https://www.documentcloud.org/documents/26084996-proposed-an...

> reproducing purchased and scanned books to train AI constituted fair use

greensoap 3 days ago | parent | next [-]

Actually, the court really only said downloading a pirated book to store in your "library" was bad. The opinion is intentionally? ambiguous on whether the decision regarding copies used to train an LLM applies only to scanned books or also to pirated books. The facts found in the case are the training datasets were made from the "library" copies of books that included scans and pirated downloads. And the court said the training copies were fair use. The court also said the scanned library copies were fair use. The court found that the pirated library copies was not fair use. The court did not say for certain whether the pirated training copies were fair use.

thaumasiotes 3 days ago | parent | prev [-]

The usual analysis was that when you download a book from Library Genesis, that is an instance of copyright infringement committed by Library Genesis. This ruling appears to reverse that analysis.

papercrane 3 days ago | parent [-]

Do you have a source for that because MAI Systems Corp. v. Peak Computer, Inc established that even creating a copy in RAM is considered a "copy" under the Copyright Act and can be infringement.

parineum 3 days ago | parent [-]

It's not an issue of where it's being copied, it's who's doing the copying.

Library Genesis has one copy. It then sends you one copy and keeps it's own. The entity that violated the _copy_right is the one that copied it, not the one with the copy.

masfuerte 3 days ago | parent [-]

There are many copies made as the text travels from Library Genesis to Anthropic. This isn't just of theoretical interest. English law has specific copyright exemptions for transient copies made by internet routers, etc. It doesn't have exemptions for the transient copies made by end users such as Anthropic, and they are definitely infringing.

Of course, American law is different. But is it the case that copies made for the purpose of using illegally obtained works are not infringing?

thaumasiotes 3 days ago | parent [-]

> But is it the case that copies made for the purpose of using illegally obtained works are not infringing?

Well, the question here is "who made the copy?"

If you advertise in seedy locations that you will send Xeroxed copies of books by mail order, and I order one, and you then send me the copy I ordered, how many of us have committed a copyright violation?

masfuerte 3 days ago | parent [-]

Copyright law is literally about the copies. A xeroxed book is exactly one copy. Mailing and reading that book doesn't copy it any further. In contrast, you can't do anything with digital media without making another copy.

> "Who made the copy?"

This begs the question. With digital media everybody involved makes multiple copies.

bhickey 3 days ago | parent | prev [-]

Probably the latter.

gowld 3 days ago | parent | prev [-]

I thought that distribution of copyrighted materials was legally encumbered, not reception thereof.

lawlessone 3 days ago | parent | next [-]

Did they use a torrent? If they used a torrent isn't it likely they distributed it while downloading it?

gkbrk 3 days ago | parent [-]

Surely a state-of-the-art tech company would know how to disable seeding.

LeoPanthera 3 days ago | parent [-]

BitTorrent clients will not send data to clients which aren't uploading, as far as I know.

adrr 3 days ago | parent | prev | next [-]

Downloading is making a copy and covered by copyright law. Its also covered by statutory damages clause of up to $150k per violation if willful. I assume Anthropic knew they were using pirated books.

thayne 3 days ago | parent | prev [-]

Do you have a source for that? My understanding was that both were illegal, although of course media companies have an interest in making people believe that even if it isn't true.

wingspar 3 days ago | parent | prev | next [-]

My understanding is this settlement is about the MANNER in which Anthropic acquired the text of the books. They downloaded illegal copies of the books.

There was no issues with the physical copies of books they purchased and scanned.

I believe the issue of USING these texts for AI training is a separate issue/case(s)

Retric 3 days ago | parent | prev | next [-]

Penalties can be several times actual damages, and substantial similarity includes MP3 files and other lossy forms of compression which don’t directly look like the originals.

The entire point of deep learning is to copy aspects from training materials, which is why it’s unsurprising when you can reproduce substantial material from a copyrighted work given the right prompts. Proving damages for individual works in court is more expensive than the payout but that’s what class action lawsuits are for.

gruez 3 days ago | parent | prev [-]

>Were your books imitated by AI?

Given that books can be imitated by humans with no compensation, this isn't as strong as an argument as you think. Moreover AFAIK the training itself has been ruled legal, so Anthropic could have theoretically bought the book for $20 (or whatever) and be in the clear, which would obviously bring less revenue than the $9k settlement.

visarga 3 days ago | parent | next [-]

Copyright should be about copying rights, not statistical similarities. Similarity vs causal link - a different standard all together.

gruez 3 days ago | parent | next [-]

>Copyright should be about copying rights, not statistical similarities

So you're agreeing with me? The courts have been pretty clear on what's copyrightable. Copyrights only protect specific expressions of an idea. You can copyright your specific writing of a recipe, but not the concept of the dish or the abstract instructions itself.

dotnet00 3 days ago | parent | prev | next [-]

Those statistical similarities originate from a copyright violation, there's your causal link. Basically the same as selling a game made using pirated Photoshop.

reissbaker 3 days ago | parent | next [-]

Selling a game whose assets were made with a pirated copy of Photoshop does not extend Adobe's copyright to cover your game itself. They can sue you for using the pirated copy of Photoshop, but they can't extend copyright vampirically in that manner — at least, not in the United States.

(They can still sue for damages, but they can't claim copyright over your game itself.)

thaumasiotes 3 days ago | parent | next [-]

Well, there are damages torts and there's also an unjust enrichment tort. In the paradigm example where you make funding available to your treasurer and he makes an unscheduled stop in Las Vegas to bet it on black, you can sue him for damages. If he lost the bet, he owes you the amount he lost. If he won, he owes you nothing (assuming he went on and deposited the full amount in your treasury as expected).

Or you could sue him on a theory of unjust enrichment, in which case, if he lost, he'd owe you nothing, and if he won, he'd owe you all of his winnings.

It's not clear to me why the same theory wouldn't be available to Adobe, though the copyright question wouldn't be the main thrust of the case then.

dotnet00 3 days ago | parent | prev | next [-]

Are the authors claiming copyright over the LLM? My understanding is they were suing Anthropic for using the authors' data in their training product. The court ruled that using the books for training would be fair use, but that piracy is not fair use.

Thus, isn't the settlement essentially Anthropic admitting that they don't really have an effective defense against the piracy claim?

reissbaker 2 days ago | parent [-]

Oh I don't disagree that the authors may have a compelling case — I was just responding to the statistical similarities vs copying argument. Anthropic may have violated the authors rights, but technically that doesn't extend copyright via a "causal link."

The authors can still sue for damages though (and did, and had a strong enough case Anthropic is trying to settle for over a billion dollars).

gowld 3 days ago | parent | prev [-]

What is illegal about using pirated software that someone else distributed to you, if you never agreed to a license contract?

dotnet00 3 days ago | parent [-]

If you can show that the pirated copy was provided to you without your knowledge, and that there was no reasonable way for you to know that it was pirated, there probably isn't anything illegal about it for you.

But otherwise, you're essentially asking if you can somehow bypass license agreements by simply refusing to read them, which would obviously render all licensing useless.

thaumasiotes 3 days ago | parent [-]

Why do you think reading the agreement is notionally mandatory before the software becomes functional?

dotnet00 3 days ago | parent [-]

Most paid software generally makes you acknowledge that you have read and accepted the terms of the license before first use, and includes a clause that continued use of the software constitutes acceptance of the license terms.

In the event that you try to play games to get around that acknowledgement: Courts aren't machines, they can tell that you're acting in bad faith to avoid license restrictions and can punish you appropriately.

thaumasiotes 3 days ago | parent [-]

>> Why do you think reading the agreement is notionally mandatory before the software becomes functional?

> Most paid software generally makes you acknowledge that you have read and accepted the terms of the license before first use, and includes a clause that continued use of the software constitutes acceptance of the license terms.

Huh. If only I'd known that.

Why do you think that is?

dotnet00 3 days ago | parent [-]

How about you directly say what you're trying to say instead of being unnecessarily sarcastic?

thaumasiotes 3 days ago | parent [-]

If it's possible to use software without agreeing to the license, then the license really doesn't bind the user. That's why people try to make it mandatory.

Why did you think it made sense to respond to the question "Why do you think X is true?" with "Did you know that X is true?"?

terminalshort 3 days ago | parent | prev [-]

The statistical similarities originate from fair use, just as the judge ruled in this case.

Retric 3 days ago | parent | prev [-]

The entire purpose of training materials is to copy aspects of them. That’s the causal link.

visarga 3 days ago | parent | next [-]

> That’s the causal link.

But copyright was based on substantial similarity, not causal links. That is the subtle change. Copyright is expanding more and more.

In my view, unless there is substantially similarity to the infringed work, copyright should not be invoked.

Even the substantial similarity concept is already an expanded concept from original "protected expression".

It makes no sense to attack gen-AI for infringement, if we wanted the originals we would get the originals, you can copy anything you like on the web. Generating bootleg Harry Potter is slow, expensive and unfaithful to the original. We use gen-AI for creating things different from the training data.

Retric 2 days ago | parent [-]

Substantial similarly is less stringent than causal links. With substantial similarity the worlds’s a landline of unpopular media.

Copyright isn’t supposed to apply if you happen to write a story that bares an uncanny similarity to a story you never read written in 1952 in a language you don’t know that sold 54 copies.

Dylan16807 3 days ago | parent | prev [-]

The aspect it's supposed to copy is the statistics of how words work.

And in general, when an LLM is able to recreate text that's a training error. Recreating text is not the purpose. Which is not to excuse it happening, but the distinction matters.

program_whiz 3 days ago | parent [-]

In training, the model is trained to predict the exact sequence of words of a text. In other words, it is reproducing the text repeatedly for its own trainings. The by-product of this training is that it influences model weights to make the text more likely to be produced by the model -- that is its explicit goal. A perfect model would be able to reproduce the text perfectly (0 loss).

Real-world absurd example: A company hires a bunch of workers. They then give them access to millions of books and have the workers reading the books all day. The workers copy the books word by word, but after each word try to guess the next word that will appear. Eventually, they collectively become quite good at guessing the next word given a prompt text, even reproducing large swaths of text almost verbatim. The owner of company Y claims they owe nothing to the book owners, because it doesn't count as reading the book, and any reproduction is "coincidental" (even though this is the explicit task of the readers). They then use these workers to produce works to compete with the authors of the books, which they never paid for.

It seems many people feel this is "fair use" when it happens on a computer, but would call it "stealing" if I pirated all the books of JK Rowling to train myself to be a better mimicker of her style. If you feel this is still fair use, then you should agree all books should be free to everyone (as well as art, code, music, and any other training material).

gruez 3 days ago | parent | next [-]

>but would call it "stealing" if I pirated all the books of JK Rowling to train myself to be a better mimicker of her style

Can you provide an example of someone being successfully sued for "mimicking style", presumably in the US judicial system?

snowe2010 3 days ago | parent | next [-]

> Second, the songs must share SUBSTANTIAL SIMILARITY, which means a listener can hear the songs side by side and tell the allegedly infringing song lifted, borrowed, or appropriated material from the original.

Music has had this happen numerous times in the US. The distinction isn’t an exact replica, it’s if it could be confused for the same style.

George Harrison lost a case for one of his songs. There are many others.

https://ultimateclassicrock.com/george-harrison-my-sweet-lor...

program_whiz 3 days ago | parent | prev | next [-]

The damages arise from the very process of stealing material for training. The justification "yes but my training didn't cause me to directly copy the works" is faulty.

I won't rehash the many arguments as to why the output is also a violation, but my point was more the absurd view that stealing and using all the data in the world isn't a problem because the output is a lossy encoding (but the explicit training objective is to reproduce the training text / image).

Retric 3 days ago | parent | prev [-]

Style in an ambiguous term here as it doesn’t directly map to what’s being considered. The case between “Blurred Lines” and “Got to Give It Up” is often considered one of style and the Court of Appeals for the Ninth Circuit upheld copyright infringement.

However, AI has been show to copy a lot more than what people consider style.

Dylan16807 3 days ago | parent | prev [-]

> In training, the model is trained to predict the exact sequence of words of a text. In other words, it is reproducing the text repeatedly for its own trainings.

That's called extreme overfitting. Proper training is supposed to give subtle nudges toward matching each source of text, and zillions of nudges slowly bring the whole thing into shape based on overall statistics and not any particular sources. (But that does require properly removing duplicate sources of very popular text which seems to be an unsolved problem.)

So your analogy is far enough off that I can't give it a good reply.

> It seems many people feel this is "fair use" when it happens on a computer, but would call it "stealing" if I pirated all the books of JK Rowling to train myself to be a better mimicker of her style.

I haven't seen anyone defend the piracy, and the piracy is what this settlement is about.

People are defending the training itself.

And I don't think anyone would seriously say the AI version is fair use but the human version isn't. You really think "many people" feel that way?

Retric 3 days ago | parent [-]

There isn’t a clear line for extreme overfitting here.

To generate working code the output must follow the API exactly. Nothing separates code and natural language as far as the underlying algorithm is concerned.

Companies slightly randomize output to minimize the likelihood of direct reproduction of source material, but that’s independent of what the neural network is doing.

Dylan16807 3 days ago | parent [-]

You want different levels of fitting for different things, which is difficult. Tight fighting on grammar and APIs and idioms, loose fitting on creative text, and it's hard to classify it all up front. But still, if it can recite harry potter that's not on purpose, and it's never trained to predict a specific source losslessly.

And it's not really about randomizing output. The model gives you a list of likely words, often with no clear winner. You have to pick one somehow. It's not like it's taking some kind of "real" output and obfuscating it.

Retric 3 days ago | parent [-]

> often with no clear winner. You have to pick one somehow. It's not like it's taking some kind of "real" output and obfuscating it.

It’s very rare for multiple outputs to actually be equal so the only choice is to choose one at random. Instead its become accepted practice to make sub optimal choices for a few reasons, one of which really is to decrease the likelihood of reproducing existing text.

Nobody wants a headline like: “Meta's Llama 3.1 can recall 42 percent of the first Harry Potter book” https://www.understandingai.org/p/metas-llama-31-can-recall-...

Dylan16807 3 days ago | parent [-]

I will say that picking the most likely word every single time isn't optimal.

Retric 3 days ago | parent [-]

I agree there’s multiple reasons to slightly randomize output, but there’s also downsides.

arduanika 3 days ago | parent | prev [-]

Machines aren't people.

gruez 3 days ago | parent [-]

They're not, but that's a red herring given that humans vs machines is not a relevant factor in current copyright statues or case law. Short of new laws being passed or activist judges ruling otherwise, it'll remain this way.

snowe2010 3 days ago | parent [-]

But whether or not it is a machine _is_ relevant in current copyright law. https://constitutioncenter.org/blog/federal-court-rules-arti...