Remix.run Logo
advael 3 days ago

I think there's no meaningful case by the letter of the law that use of training data that include GPL-licensed software in models that comprise the core component of modern LLMs doesn't obligate every producer of such models to make both the models and the software stack supporting them available under the same terms. Of course, it also seems clear in the present landscape that the law often depends more on the convenience of the powerful than its actual construction and intent, but I would love to be proven wrong about that, and this kind of outcome would help

tpmoney 3 days ago | parent | next [-]

> I think there's no meaningful case by the letter of the law that use of training data that include GPL-licensed software in models that comprise the core component of modern LLMs doesn't obligate every producer of such models to make both the models and the software stack supporting them available under the same terms.

Why do you think "fair use" doesn't apply in this case? The prior Bartz vs Anthropic ruling laid out pretty clearly how training an AI model falls within the realm of fair use. Authors Guild vs Google and Authors Guild vs HathiTrust were both decided much earlier and both found that digitizing copyrighted works for the sake of making them searchable is sufficiently transformative to meet the standards of fair use. So what is it about GPL licensed software that you feel would make AI training on it not subject to the same copyright and fair use considerations that apply to books?

shakna 3 days ago | parent | next [-]

Bartz v Anthropic explicitly held ruling on fair use. It is not precedent, here.

derektank 2 days ago | parent [-]

I’m not a lawyer, but I read the decision, and how is this section not a ruling on fair use?

“To summarize the analysis that now follows, the use of the books at issue to train Claude and its precursors was exceedingly transformative and was a fair use under Section 107 of the Copyright Act. And, the digitization of the books purchased in print form by Anthropic was also a fair use but not for the same reason as applies to the training copies. Instead, it was a fair use because all Anthropic did was replace the print copies it had purchased for its central library with more convenient space-saving and searchable digital copies for its central library — without adding new copies, creating new works, or redistributing existing copies. However, Anthropic had no entitlement to use pirated copies for its central library. Creating a permanent, general-purpose library was not itself a fair use excusing Anthropic’s piracy.”

Or in the final judgement, “This order grants summary judgment for Anthropic that the training use was a fair use. And, it grants that the print-to-digital format change was a fair use for a different reason.”

shakna 2 days ago | parent [-]

There's two parts here.

The first:

> it was a fair use because all Anthropic did was replace the print copies it had purchased for its central library

It is only fair use where Anthropic had already purchased a license to the work. Which has zero to do with scraping - a purchase was made, an exchange of value, and that comes with rights.

The second, which involves a section of the judgement a little before your quote:

> And, as for any copies made from central library copies but not used for training, this order does not grant summary judgment for Anthropic.

This is where the court refused to make any ruling. There was no exchange of value here, such as would happen with scraping. The court made no ruling.

tpmoney 2 days ago | parent [-]

I believe you are misinterpreting the ruling. Remember that a copyright claim must inherently argue that copies of the work are being made. To that end, the case analyzes multiple "copies" alleged to have been made.

1) "Copies used to train specific LLMs", for which the ruling is:

> The copies used to train specific LLMs were justified as a fair use.

> Every factor but the nature of the copyrighted work favors this result.

> The technology at issue was among the most transformative many of us will see in our lifetimes.

Notable here is that all of the "copies used to train specific LLMs" were copies made from books Anthropic purchased. But also of note is that Anthropic need not have purchased them, as long as they had obtained the original sources legally. The case references the Google Books lawsuit as an example of something Anthropic could have done to avoid pirating the books they did pirate where in Google obtained the original materials on loan from willing and participating libraries, and did not purchase them.

2) "The copies used to convert purchased print library copies into digital library copies", where again the ruling is:

> justified, too, though for a different fair use. The first factor strongly

> favors this result, and the third favors it, too. The fourth is neutral. Only

> the second slightly disfavors it. On balance, as the purchased print copy was

> destroyed and its digital replacement not redistributed, this was a

> fair use.

Here one might argue where the use of GPL code is different in that in making the copy, no original was destroyed. But it's also very likely that this wouldn't apply at all in the case of GPL code because there was also no original physical copy to convert into a digital format. The code was already digitally available.

3) "The downloaded pirated copies used to build a central library" where the court finds clearly against fair use.

4) "And, as for any copies made from central library copies but not used for training" where as you note Judge Alsup declined to rule. But notice particularly that this is referring to copies made FROM the central library AND NOT for the purposes of training an LLM. The copies made from purchased materials to build the central library in the first place were already deemed fair use. And making copies from the central library to train an LLM from those copies was also determined to be fair use.The copies obtained by piracy were not. But for uses not pertaining to the training of an LLM, the judge is declining to make a ruling here because there was not enough evidence about what books from the central library were copied for what purposes and what the source of those copies was. As he says in the ruling:

> Anthropic is not entitled to an order blessing all copying “that Anthropic has ever made after obtaining the data,” to use its words

This declination applies both to the purchased and pirated sources, because it's about whether making additional copies from your central library copies (which themselves may or may not have been fair use), automatically qualifies as fair use. And this is perfectly reasonable. You have a right as part of fair use to make a copy of a TV broadcast to watch at a later time on your DVR. But having a right to make that copy does not inherently mean that you also have a right to make a copy from that copy for any other purposes. You may (and almost certainly do) have a right to make a copy to move it from your DVR to some other storage medium. You may not (and almost certainly do not) have a right to make a copy and give it to your friend.

At best, an argument that GPL software wouldn't be covered under the same considerations of fair use that this case considers would require arguing that the copies of GPL code obtained by Anthropic were not obtained legally. But that's likely going to be a very hard argument to make given that GPL code is freely distributed all over the place with no attempts made to restrict who can access that code. In fact, GPL code demands that if you distribute the software derived from that code, you MUST make copies of the code available to anyone you distribute the software to. Any AI trainer would simply need to download Linux or emacs and the GPL requires the person they downloaded that software from to provide them with the source code. How could you then argue that the original source from which copies were made was obtained illicitly when the terms of downloading the freely available software mandated that they be given a copy?

shakna 2 days ago | parent [-]

> How could you then argue that the original source from which copies were made was obtained illicitly when the terms of downloading the freely available software mandated that they be given a copy?

By the license and terms such copies are under.

> For example, if you distribute copies of such a program, whether gratis or for a fee, you must pass on to the recipients the same freedoms that you received. You must make sure that they, too, receive or can get the source code. And you must show them these terms so they know their rights.

You _must_ show the terms. If you copy the GPL code, and it inherits the license, as the terms say it does, then you must also copy the license.

The GPL does not give you an unfettered right to copy, it comes with terms and conditions protecting it under contract law. Thus, you must follow the contract.

The GPL goes to some lengths to define its terms.

> A "covered work" means either the unmodified Program or a work based on the Program.

> Propagation includes copying, distribution (with or without modification), making available to the public, and in some countries other activities as well.

It is not just the source code that you must convey.

tpmoney 21 hours ago | parent [-]

> By the license and terms such copies are under.

Which clause of the GPL requires the receiver of GPL code to agree to the terms of the GPL before being allowed to receive the source code that they are entitled to under the license? Because that would expressly contradict the first sentence of section 9:

    You are not required to accept this License in order to receive or run a copy of the Program.
Isn't that one of the key points to the GPL? That the provisions of it only apply to you IF you decide to distribute GPL software but that they do not impose any restrictions on the users of the software? Surely you're not suggesting that anyone who has ever seen the source code of a GPLed piece of software is permanently barred from contributing to or writing similar software under a non-GPL license simply by the fact that they received the GPLed source code.

> If you copy the GPL code, and it inherits the license, as the terms say it

> does, then you must also copy the license.

> The GPL does not give you an unfettered right to copy, it comes with terms

> and conditions protecting it under contract law. Thus, you must follow the > contract.

I agree that the GPL does not give you an unfettered right to copy. But the GPL like all such licenses are still governed by copyright law. And "fair use" is an exception to the copyright laws that allow you to make copies that you are not otherwise authorized to make. No publisher can put additional terms in their book, even if they wrap it in shrinkwrap that denies you the right to use that book for various fair use purposes like quoting it for criticism or parody. The Sony terms and conditions for the Play Station very clearly forbid copying the BIOS or decompiling it. But those terms are null and void when you copy the BIOS and decompile it for making a new emulator (at least in the US) because the courts have already ruled that such use is fair use.

So it is with the GPL. By default you have no right to make copies of the software at all. The GPL then grants you additional rights you normally wouldn't have under copyright law, but only to the extent that when exercising those rights, you comply with the terms of the GPL. But "Fair Use" then goes beyond that and says that for certain purposes, certain types and amounts of copies can be made, regardless of what rights the publisher does or does not reserve. This would be why the GPL specifically says:

    This License acknowledges your rights of fair use or other equivalent, as provided by copyright law.
Fair use (and its analogs in other countries) supersede the GPL. And even the GPL FAQ[1] acknowledges this fact:

    Do I have “fair use” rights in using the source code of a GPL-covered program? (#GPLFairUse)
    Yes, you do. “Fair use” is use that is allowed without any special 
    permission. Since you don't need the developers' permission for such use, you 
    can do it regardless of what the developers said about it—in the license or 
    elsewhere, whether that license be the GNU GPL or any other free software 
    license.
[1]: https://www.gnu.org/licenses/gpl-faq.en.html#GPLFairUse
ronsor 3 days ago | parent | prev | next [-]

> So what is it about GPL licensed software that you feel would make AI training on it not subject to the same copyright and fair use considerations that apply to books?

The poster doesn't like it, so it's different. Most of the "legal analysis" and "foregone conclusions" in these types of discussions are vibes dressed up as objective declarations.

input_sh 3 days ago | parent [-]

You seem like the type of person that will believe anything as long as someone cites a case without looking into it. Bartz v Anthropic only looked at books, and there was still a 1.5 billion settlement that Anthropic paid out because it got those books from LibGen / Anna's Archive, and the ruling also said that the data has to be acquired "legitimately".

Whether data acquired from a licence that specifically forbids building a derivative work without also releasing that derivative under the same licence counts as a legitimate data gathering operation is anyone's guess, as those specific circumstances are about as far from that prior case as they can be.

eru 3 days ago | parent | next [-]

As long as they don't distribute the model's weights, even a strict interpretation of the GPL should be fine. Same reason Google doesn't have to upstream changes to the Linux kernel they only deploy in-house.

oblio 2 days ago | parent | next [-]

But LLMs do distribute the derived code they generate outside of their company. That's their entire point.

Akronymus 2 days ago | parent [-]

But wouldn't that be like some company using gpl licensed code to host a code generator for something? At least in a legal interpretation. Or is that different?

oblio 2 days ago | parent | next [-]

And why would that be different or allowed? Sure, you get all the code you want, GPL licensed.

Everybody is trying to have their cake and eat it, too, by license laundering.

Heck, money laundering means you at least lose some of the money.

Akronymus 2 days ago | parent [-]

I have no idea. I genuinly was asking out of curiosity on what the law actually means for that while speculating.

advael 2 days ago | parent | prev [-]

I mean, is the case you're making that you can run a SaaS business on GPL-derived code without fulfilling GPL obligations because you're not distributing a binary?

eru 2 days ago | parent | next [-]

Yes, that's exactly what people do and did. That 'loophole' is the whole reason people came up with https://en.wikipedia.org/wiki/GNU_Affero_General_Public_Lice...

Akronymus 2 days ago | parent | prev [-]

I guess I am. I genuinly am just a layperson trying to look at what the law would say, so everything is speculation.

advael 2 days ago | parent [-]

If true that would seem to invalidate the entire GPL, but even by that logic, a website (such as chatGPT) distributes javascript that runs the code, and programs like claude code also do so. Again, if you can slip the GPL's requirements through indirection like having your application go phone home to your server to go get the infringing parts, the GPL would essentially unenforceable in... most contexts

fragmede 2 days ago | parent [-]

That's where the AGPL comes in. The GPL(v2) does not require eg Google or Facebook to release any of the changes they've made to the Linux kernel. That they do so is not because of a legal obligation to do so. The "to get parts" thing is the relevant detail to be very specific on. If those parts are a binary that is used, then the GPL does kick in, but for distributing source code that's possibly derived, possibly not covered by copyright, it's not been decided in a court of law yet.

fsflover 2 days ago | parent | prev [-]

How about AGPL?

eru 2 days ago | parent [-]

Sure, that one was specifically designed to close that loophole.

ronsor 2 days ago | parent | prev [-]

Have you actually read the text of the GPL?

> This License acknowledges your rights of fair use or other equivalent, as provided by copyright law.

It is legitimate to acquire GPL software. The requirements of the license only occur if you're distributing the work AND fair use does not apply.

Training certainly doesn't count as distribution, so the buck passes to inference, which leaves us dealing with substantial similarity test, and still, fair use.

apatheticonion 2 days ago | parent | next [-]

There is the clean room problem though.

If a human reads GPL code and outputs a recreation of that code (derivative work) using what they learned - that is illegal.

If an AI reads GPL code and outputs a recreation of that code using what it "learned" - it's not illegal?

If that is the case, then copyright holds no weight any more. I should be allowed to train an LLM on decompiled firmware (say, Playstation, Switch, iPhone) in countries where decompilation is legal - then have the LLM produce equivalent firmware that I later use to build an emulator (or competing open source firmware).

tpmoney 2 days ago | parent [-]

> If that is the case, then copyright holds no weight any more. I should be allowed to train an LLM on decompiled firmware (say, Playstation, Switch, iPhone) in countries where decompilation is legal - then have the LLM produce equivalent firmware that I later use to build an emulator (or competing open source firmware).

It's funny you mention that, because one of the biggest fair use cases that effectively cemented "fair use" for emulators is Sony Computer Entertainment Inc v. Connectix Corp.[1] where the copying of PlayStaion BIOS files for the purposes of reverse engineering and creating an emulator was explicitly ruled to be fair use, including running that code through a disassembler.

[1]: https://en.wikipedia.org/wiki/Sony_Computer_Entertainment,_I....

input_sh 2 days ago | parent | prev [-]

You and I are not a fucking judge, our opinions on this don't matter one bit. We might as well print it on a piece of paper and wipe our asses with it.

jerf 2 days ago | parent | prev | next [-]

You sound like you're citing the general Internet understanding of "fair use", which seems to amount to "I can do whatever I like to any copyrighted content as long as maybe I mutilate it enough and shout 'FAIR USE!' loudly enough."

On the real measures of "fair use", at least in the US: https://fairuse.stanford.edu/overview/fair-use/four-factors/ I would contend that it absolutely face plants on all four measures. The purpose is absolutely in the form of a "replacement" for the original, the nature is something that has been abundantly proved many times over in court as being something copyrightable as a creative expression (with limited exceptions for particular bits of code that are informational), the "amount and substantiality" of the portions used is "all of it", and the effect of use is devastating to the market value of the original.

You may disagree. A long comment thread may ensue. However, all I really need for my point here is simply that it is far, far from obvious that waving the term "FAIR USE!" around is a sufficient defense. It would be a lengthy court case, not a slam-dunk "well duh it's obvious this is fair use". The real "fair use" and not the internet's "FAIR USE!" bear little resemblance to each other.

A sibling comment mentions Bartz v. Anthropic. Looking more at the details of the case I don't think it's obvious how to apply it, other than as a proof that just because an AI company acquired some material in "some manner" doesn't mean they can just do whatever with it. The case ruled they still had to buy a copy. I can easily make a case that "buying a copy" in the case of a GPL-2 codebase is "agreeing to the license" and that such an agreement could easily say "anything trained on this must also be released as GPL-2". It's a somewhat lengthy road to travel, where each step could result in a failure, but the same can be said for the road to "just because I can lay my hands on it means I can feed it to my AI and 100% own the result" and that has already had a step fail.

jrm4 2 days ago | parent | next [-]

"Real" fair use is perhaps one of the most nebulous legal concepts possible. I haven't dived deep into software, but a cursory look at how it "works (I use that term as loosely as possible)" in music with sampling and interpolation etc immediately reveals that there's just about nothing one can rely on in any logical sense.

tpmoney 2 days ago | parent | prev [-]

I'm not really sure why you think my comment specifically citing the recent rulings by Judge Alsup and also the prior history with respect to the Google Books project is somehow declaring "I can do whatever I like to any copyrighted content", but I assure you I'm not. I'm very specifically talking about the various cases that have come about in the digital age dealing with fair use as it has been interpreted by US courts to apply to the use of computers to create copies of works for the purposes of creating other works.

I'm referring to the long history of carefully threaded fair use rulings and settlements, many of which we as an industry have benefitted greatly from. From determinations that cloning a BIOS can be fair use (see IBM PC bios cloning, but also Sony v. Connectix), or that cloning an entire API for the purposes of creating a parallel competitive product (Google v. Oracle), or digitizing books for the purposes of making those books searchable and even displaying portions of those books to users (Authors Guild v. Google) or even your cable company offering you "remote DVR" copying of broadcast TV (20th Century Fox v. Cablevision). Time and again the courts have found that copyright, and especially copyright with respect to digital transformations is far more limited than large corporations would prefer. Further they have found in plenty of cases that even a direct 1:1 copy of source can be fair use, let alone copies which are "transformative" as LLM training was found to be in Bartz.

Realistically, I don't see how anyone can have watched the various copyright cases that have been decided in the digital age, and seen the battles that the EFF (and a good part of the tech industry) have waged to reduce the strength of copyright and not also see how AI training can very easily fit within that same framework.

Not to cast aspersions on my fellow geeks and nerds, but it has been very interesting to me to watch the "hacker" world move from "information wants to be free" to "copyright maximalists" once it was their works that were being copied in ways they didn't like. For an industry that has brought about (and heavily promoted and supported) things like DeCSS, BitTorrent, Handbrake, Jellyfin/Plex, numerous emulators, WINE, BIOS and hardware cloning, ad blockers, web scrapers and many other things that copyright owners have been very unhappy about, it's very strange to see this newfound respect for the sanctity of copyright.

> I can easily make a case that "buying a copy" in the case of a GPL-2 codebase is "agreeing to the license" and that such an agreement could easily say "anything trained on this must also be released as GPL-2".

And I would argue that obtaining a legal copy of the GPL source to a program requires no such agreement. By downloading a copy of a GPLed program I am entitled by the terms under which that software was distributed to obtain a copy of the source code. I do not have to agree to any other terms in order to obtain that source code, downloading from someone authorized to distribute that code is in and of itself sufficient to entitle me to that source code. You can not, by the very terms of the GPL itself deny me a copy of the source code for GPL software you have distributed to me, even if you believe I intend to make distributions that are not GPL compliant. You can decline to distribute the software to me in the first place, but once you have distributed it to me, I am legally entitled to a copy of the source code. From there, now that I have a legal copy, the question becomes is making additional copies for the purposes of training an AI model fair use? So far, the most definitive case we have on the matter (Bartz) says yes it is.

So either we have to make the case that the original copy was somehow acquired from a source not authorized to make that copy, or we have to argue that the output of the AI model or the AI model is itself infringing. Given the ruling that copies made for training an AI model was ruled "exceedingly transformative and was a fair use under Section 107 of the Copyright Act"[1] it seems unlikely that the AI model itself is going to be found to be infringing. That leaves the output of the model itself, which Bartz does not rule on, as the authors never alleged the output of the model was infringing. GPL software authors might be able to prevail on that point, but they would have a pretty uphill battle I think in demonstrating that the model generated infringing output and not simply functional necessary code that isn't covered by copyright. The ability of code to be subject to copyright has long been a sort of careful balance between protecting a larger creative idea, and also not simply walling off whole avenues of purely functional decisions from all competitors.

[1]: https://admin.bakerlaw.com/wp-content/uploads/2025/07/ECF-23...

advael 3 days ago | parent | prev [-]

Broadly speaking, GPL is a license that has specific provisions about creating derivative software from the licensed work, and just saying "fair use" doesn't exempt you from those provisions. More specifically, an advertised use case (in fact, arguably the main one at this stage) of the most popular closed models as they're currently being used is to produce code, some of which is going to be GPL licensed. As such, the code used is part of the functionality of the program. The fact that this program was produced from the source code used by a machine learning algorithm rather than some other method doesn't change this fundamental fact.

The current supreme court may think that machine learning is some sort of magic exception, but they also seem to believe whatever oligarchs will bribe them to believe. Again, I doubt the law will be enforced as written, but that has more to do with corruption than any meaningful legal theory. Arguments against this claim seem to ignore that courts have already ruled these systems to not have intellectual property rights of their own, and the argument for fair use seems to rely pretty heavily on some handwavey anthropomorphization of the models.

mr_toad 2 days ago | parent [-]

> Broadly speaking, GPL is a license that has specific provisions about creating derivative software from the licensed work, and just saying "fair use" doesn't exempt you from those provisions.

Broadly speaking, yes it does. The whole point of fair use is that you don’t need a license.

davemp 2 days ago | parent [-]

Claiming LLMs are fair use is ridiculous bordering on ignorant or disingenuous.

Here’s the 4 part test from 17 U.S.C. § 107:

1. the purpose and character of the use, including whether such use is of a commercial nature or is for nonprofit educational purposes;

Fail. The use is to make trillions of dollars and be maximally disruptive.

2. the nature of the copyrighted work;

Fail. In many cases at least, the copy written code is commercial or otherwise supports livelihoods; and is the result much high skill labor with the express stipulation for reciprocity.

3. the amount and substantiality of the portion used in relation to the copyrighted work as a whole; and

Fail. They use all of it.

4. the effect of the use upon the potential market for or value of the copyrighted work.

Fail to the extreme. There is already measurable decline in these markets. The leaders explicitly state that they want to put knowledge workers out of business.

- - -

Hell, LLMs don’t even pass the sniff test.

The only reason this stuff is being entertained is some combination of the prisoner’s dilemma and more classic greed.

cxr 2 days ago | parent | next [-]

This comment highlights a basic dilemma about how and where to spend your time.

Here's a basic rule of thumb I recommend people apply when it comes to these sorts of long, contentious threads where you know that not every person showing up to the conversation is limiting themselves to commenting about things they understand and that involve some of the most tortured motivated reasoning about legal topics:

If the topic is copyright and someone who is speaking authoritatively has just used the words "copy written", then ignore them. Consider whether you need to be anywhere in the conversation at all, even as a purely passive observer. Think about all the things you can do instead of wasting your time here, where the stakes for participation are so low because nothing that is said here really matters. Go do something productive.

2 days ago | parent | next [-]
[deleted]
davemp 2 days ago | parent | prev [-]

Yet you still wasted your own time and everyone else’s time with a reply that has even less substance.

I was making an argument based on quotes from the actual legal code and you’re saying pions who don’t use the exact correct terminology shouldn’t even consider what should or shouldn’t be legal? What a load of junk. This is a democracy. We’re supposed to be engaging with it.

luma 2 days ago | parent | prev | next [-]

You’re mixing up “using” with “copying”. You are allowed to “use” all of a book or movie or code by listening to or watching or reviewing the whole thing. Copyright protects copies. The legal claim here is than training an LLM is sufficiently transformative such that it cannot be construed as a copy.

davemp 2 days ago | parent [-]

I replied to someone saying that it’s fair use, which presupposes that it’s a derivative work.

joshuacc 2 days ago | parent | prev | next [-]

These are factors to be considered, not pass/fail questions.

tpmoney 2 days ago | parent | prev [-]

> Fail. The use is to make trillions of dollars and be maximally disruptive.

Fair use has repeatedly been found even in cases where the copies were used for commercial purposes. See Sony v. Connectix for example, where the cloning and disassembly of the PlayStation BIOS for the purposes of making a commercially sold (at retail, in a box) emulator of a then currently sold game console was determined to be fair use.

> Fail. In many cases at least, the copy written code is commercial or otherwise supports livelihoods; and is the result much high skill labor with the express stipulation for reciprocity.

Again, see Sony V. Connectix where the sales of PlayStation consoles support the livelihoods and skilled labor of Sony engineers.

> Fail. They use all of it.

And again, see Sony V. Connectix, where the entire BIOS was copied again and again until a clone could be written that sought to reproduce all the functionality of the real BIOS. Or see Google V. Oracle where cloning the entire Java API for a competing commercial product was also deemed fair use. Or the Google Books lawsuits, where cloning entire books for the purposes of making them searchable online was deemed fair use. Or see any of the various time/format shifting cases over the years (Cassette tapes, VCRs, DVRs, MP3 encoders, DVD ripping etc) where making whole and complete copies of works is deemed fair use.

> Fail to the extreme. There is already measurable decline in these markets. The leaders explicitly state that they want to put knowledge workers out of business.

Again, see Sony v. Connectix where the commercial product deemed to be fair use was directly competing with an actively sold video game console. Copyright protects the rights of creators to exploit their own works, it does not protect them against any and all forms of competition.

Or perhaps instead of referring you to the history of legislation around copyright in the digital age, I should instead simply point you at Judge Alsup's ruling in the Bartz case where he details exactly why the facts of the case and prior case law find that training an AI on copyrighted material is fair use [1]. Of particular interest to you might be the fact that each of the 4 factors is not a simple "pass/fail" metric, but a weighing of relative merits. For example, when examining factor 1, Judge Alsup writes:

> That the accused is a commercial entity is indicative, not dispositive. That

> the accused stands to benefit is likewise indicative. But what matters most

> is whether the format change exploits anything the Copyright Act reserves to

> the copyright owner.

[1]: https://admin.bakerlaw.com/wp-content/uploads/2025/07/ECF-23...

davemp a day ago | parent [-]

I appreciate the detailed reply and that there’s subtlety here.

I read the linked Bartz case. It’s disappointing that it seems limited to only the copying of books into a data set and not the result of training LLM on protected works. This is not the “use” that I was discussing and not very interesting.

The plaintiffs didn’t even challenge that the outputs of the LLMs infringe. They judge seems to agree (at least by omission) that fair use wouldn’t apply but that the outputs were transformative and in cases where they aren’t:

> [anthropic] placed additional software between the user and the underlying LLM to ensure that no infringing output ever reached the users.

So this is not true:

> he [the judge] details exactly why the facts of the case and prior case law find that training an AI on copyrighted material is fair use

The plaintiffs also make really awful arguments about “memorizing” and “learning” that falsely anthropomorphize LLMs. Which the judge shoots down.

If we’re going to give LLMs the same rights as humans, there’s unlikely to much of an argument.

I think there’s potential for an argument about how LLMs use “compressed” versions of protected works to _mechanically_ traverse language space. It would be subtle and technical so maybe not likely to work in our current context.

tpmoney 21 hours ago | parent [-]

> It’s disappointing that it seems limited to only the copying of books into a data set and not the result of training LLM on protected works. This is not the “use” that I was discussing and not very interesting.

I agree that a ruling on the outputs specifically would have been interesting an instructive, but I disagree with the interpretation that by omission fair use would not apply to those outputs. The outputs were not challenged as the judge notes because the plaintiffs did not allege the outputs of the AI were infringing. The only conclusion we can really draw from this is that the plaintiffs didn't think they could make a good case for the outputs being infringing. Maybe GPL software authors could do so, but clearly these book authors did not think they could. Judge Alsup does note that it's certainly possible for those outputs to be infringing, but that such a case would have to be litigated separately.

And again, this all makes sense to me if you've followed copyright law through the digital age. A xerox machine can be use to create verbatim, clearly infringing copies of works covered by copyright. But that being the case does not mean that making a xerox machine is a violation of copyright, even if you use copyrighted material to test the machine. It does not mean that selling a xerox machine is a violation of copyright, even if you use copyrighted material to demonstrate the capabilities when selling the machine. And it does not mean that every use of a xerox machine is inherently a copyright violation, even if any individual use can be.

Similarly consider CD ripping software (like iTunes) or DVD/BluRay ripping software like Handbrake. I would be comfortable betting that over 90% of all copies made by iTunes or Handbrake are copies of works that the copy maker does not own copyright to (remember the "Rip, Mix, Burn" iTunes commercials?). But that being the case, iTunes CD ripping capabilities and Handbrakes DVD ripping capabilities are not themselves copyright violations, nor is distributing that software, even with instructions for how the end user can use that software to make copies of material that they do not own the copyright for. That this software can enable piracy on a mass scale does not inherently make every use of the software a copyright violation. Whether or not the output of iTunes or Handbrake is "fair use" is and must be litigated on an individual basis. The output is not inherently one or the other.

> The plaintiffs also make really awful arguments about “memorizing” and “learning” that falsely anthropomorphize LLMs. Which the judge shoots down.

> If we’re going to give LLMs the same rights as humans, there’s unlikely to much of an argument.

Judge Alsup goes much further than just "shoot[ing] down" the arguments about memorizing and learning, he also very explicitly says right on page 9:

    To summarize the analysis that now follows, the use of the books at issue to train Claude
    and its precursors was exceedingly transformative and was a fair use under Section 107 of the
    Copyright Act.
and later:

    In short, the purpose and character of using copyrighted works to train LLMs to generate
    new text was quintessentially transformative. Like any reader aspiring to be a writer,
    Anthropic’s LLMs trained upon works not to race ahead and replicate or supplant them — but
    to turn a hard corner and create something different. If this training process reasonably
    required making copies within the LLM or otherwise, those copies were engaged in a
    transformative use.
apatheticonion 3 days ago | parent | prev | next [-]

I'm struggling to parse the double negative in that statement, haha.

Are you saying that you believe that untested but technically; models trained on GPL sources need to distribute the resulting LLMs under GPL?

advael 3 days ago | parent | next [-]

Yes. Double negative intended for emphasis here, but apologies if it's confusing

eru 3 days ago | parent [-]

Well, most companies never distribute their models. So GPL doesn't kick in.

vova_hn2 2 days ago | parent [-]

I think that the claim that they make is that once a model is "contaminated" with GPL code, every output it ever produces should be considered derived from GPL code, therefore GPL-licensed as well.

Tadpole9181 2 days ago | parent [-]

So GitHub and Windows and IDEs need to be open source because they can output FOSS code? That's obviously rediculous.

If an AI outputs copyrighted code, that is a copyright violation. And if it does and a human uses it, then you are welcome to sue the human or LLM provider for that. But you don't get to sue people for perceived "latent" thought crimes.

vova_hn2 2 days ago | parent | next [-]

First of all, I'm not advocating for this claim, I'm merely trying to clarify what other people say.

That being said, I don't think that your analogy is valid in this case.

> GitHub and Windows and IDEs need to be open source because they can output FOSS code

They can output FOSS code, but they themselves are not derived from FOSS code.

It can be argued that the weights of a model is derived from training data, because they contain something from the training data (hard to say what exactly: knowledge, ideas, patterns?)

It can also be argued that output is derived from weights.

If we accept both of those claims, then GPL training data -> GPL weighs -> every output is GPL

> If an AI outputs copyrighted code

Again, the issue is not what exactly does AI output, but where it comes from.

eru 2 days ago | parent | prev [-]

It would be relatively easy to scan the output of the LLM for copyrighted material, before handing it to the user.

(I say 'relatively easy'. Not that it would be trivial.)

gottheUIblues 2 days ago | parent | prev [-]

If that theory holds - have to ensure that the models have not been trained on any code that is licensed incompatibly with the GPL, in which case the models could not be distributed at all

throwaway27448 3 days ago | parent | prev | next [-]

Intellectual property never made much sense to begin with. But it certainly makes no sense now, where the common creator has no protections against greedy corporate giants who are happy to wield the full weight of the courts to stifle any competition for longer than we'll be alive.

Or, in the case of LLMs, recklessly swing about software they don't understand while praying to find a business model.

not_paid_by_yt 3 days ago | parent | next [-]

hey just don't try to copy their LLM by distilling it, cause that's "theft", if we weren't all doomed anyways this industry would have never been allowed to exist in the first place, but I guess this is just what the last few decades of our civilization will look like.

vova_hn2 2 days ago | parent [-]

> hey just don't try to copy their LLM by distilling it, cause that's "theft"

They can call it whatever they want, but I don't think that it is illegal.

eru 3 days ago | parent | prev | next [-]

> [...] greedy corporate giants who are happy to wield the full weight of the courts to stifle any competition for longer than we'll be alive.

How is any of this new?

As1287 3 days ago | parent | prev [-]

Poor billionaire Rowling has no protections against the evil corporations. Everyone using this argument has no clue about artists and and writers.

Yes, corporations take a large cut, but creative people welcomed copyright and made the bargain and got fame in the process. Which was always better for them than let Twitch take 70% and be a sharecropper.

Silicon Valley middlemen are far worse than the media and music industry.

graemep 2 days ago | parent | next [-]

The individuals who get rich from copyright are a rarity.

Most mid-list authors make very little from copyright. A lot of the "authors" who make a lot of money from writing are celebs who slap their name on a ghost written work.

> Which was always better for them than let Twitch take 70% and be a sharecropper.

Copyright predates Twitch or giant corporations and was designed to protect the profits of the publishers from the start.

https://en.wikipedia.org/wiki/Statute_of_Anne

lstodd 2 days ago | parent | next [-]

This is so, but note that publishers were the "giant corporations" of the time.

In this nothing changed. Authors never were and still are not the point of copyright/IP.

graemep 2 days ago | parent [-]

Comparatively big for the time, but very small compared to publishing companies how.

oblio 2 days ago | parent | prev [-]

Giant corporations started in the early 1600s and back then they had gunboats :-p

graemep 2 days ago | parent [-]

there were very few of them though. They constitute a far larger proportion of the economy now.

jacquesm 2 days ago | parent | prev [-]

The reason you mention 'poor billionaire Rowling' is most likely because she's the only billionaire author that you know by name. If authors regularly became billionaires you'd have left out that name.

BobbyJo 3 days ago | parent | prev | next [-]

If the rise of Draft Kings and Polymarket/Kalshi have taught me anything, it's that the law becomes meaningless at scale. Sad.

advael 3 days ago | parent [-]

Sure, but that's more a result of policy decisions than an inevitable result of some natural law. Corporate lawlessness has been reined in before and it can be again

cogman10 3 days ago | parent | prev | next [-]

If there was going to be a case, it's derivative works. [1]

What makes it all tricky for the courts is there's not a good way to really identify what part the generated code is a derivative of (except in maybe some extreme examples).

[1] https://en.wikipedia.org/wiki/Derivative_work

felipeerias 3 days ago | parent [-]

One could carefully calculate exactly how much a given document in the training set has influenced the LLM's weights involved in a particular response.

However, that number would typically be very very very very small, making it hard to argue that the whole model is a derivative of that one individual document.

Nevertheless, a similar approach might work if you took a FOSS project as a whole, e.g. "the model knows a lot about the Linux kernel because it has been trained on its source code".

However, it is still not clear that this would be necessarily unlawful or make the LLM output a derivative work in all cases.

It seems to me that LLMs are trained on large FOSS projects as a way to teach them generalisable development skills, with the side effect of learning a lot about those particular projects.

So if I used a LLM to contribute to the kernel, clearly it would be drawing on information acquired during its training on the kernel's code source. Perhaps it could be argued that the output in that case would be a derivative?

But if I used a LLM to write a completely unrelated piece of software, the kernel training set would be contributing a lot less to the output.

cogman10 2 days ago | parent [-]

> One could carefully calculate exactly how much a given document in the training set has influenced the LLM's weights involved in a particular response.

Not really.

Think of, for example, a movie like "who framed roger rabbit". It had intellectual property from all over. Had the studios not gotten the rights from each or any of those properties, they could have been sued for copyright infringement. It's not really a question of influence.

So yeah, while the LLM might have been trained on the kernel, it was also likely trained on code with commercial licenses. Conversely, because was trained on code with GPL licenses, that might mean commercial software with LLM contributions need to inherit the GPL to be legal (and a bunch of other licenses).

It's a big old quagmire and I think lawyers haven't caught up enough with how LLMs work to realize this.

not_paid_by_yt 3 days ago | parent | prev | next [-]

That's always what laws existed for, a law is just a formal way of saying "we will use violence against you if you do something we don't like" and that has always going to be primary written by and for the people that already have the power to do that, it's not the worst, certainly better than Kings just being able to do as they please.

vova_hn2 2 days ago | parent [-]

> certainly better than Kings just being able to do as they please

That's debatable. In case of a king you always know whom to blame and who has full responsibility. No opportunity to hide behind "well, you voted for this" or "I'm not making the laws, I'm merely enforcing them".

hparadiz 3 days ago | parent | prev [-]

Derivative work.

throwaway27448 3 days ago | parent [-]

Let's cut the rot off at the root rather than pretending like the fruit is going to nourish us.

thfuran 3 days ago | parent [-]

I’m not entirely clear on what you’re suggesting abolishing here, copyright, AI, the companies making the frontier models?

m4rtink 2 days ago | parent [-]

Yes. ;-)