Isn't it up to you to prove the model used AGPLv3 code, target then for them to prove they didn't?

Not inherently.

If their model reproduces enough of an AGPLv3 codebase near verbatim, and it cannot be simply handwaved away as a phonebook situation, then it is a foregone conclusion that they either ingested the codebase directly, or did so through somebody or something that did (which dooms purely synthetic models, like what Phi does).

I imagine a lot of lawyers are salivating over the chance of bankrupting big tech.

▲

reissbaker 3 days ago | parent [-]

The onus is on you to prove that the code was reproduced and is used by the entity you're claiming violated copyright. Otherwise literally all tools capable of reproduction — printing presses, tape recorders, microphones, cameras, etc — would pose existential copyright risks for everyone who owns one. The tool having the capacity for reproduction doesn't mean you can blindly sue everyone who uses it: you have to show they actually violated copyright law. If the code it generated wasn't a reproduction of the code you have the IP rights for, you don't have a case.

TL;DR: you have not discovered an infinite money glitch in the legal system.

▲

DiabloD3 3 days ago | parent [-]

Yes! All of those things DO pose existential copyright risks if they use them to violate copyright!. We're both on the same page.

If you have a VHS deck, copy a VHS tape, then start handing out copies of it, I pick up a copy of it from you, and then see, lo and behold, it contains my copyrighted work, I have sufficient proof to sue you and most likely win.

If you train an LLM on pirated works, then start handing out copies of that LLM, I pick up a copy of it, and ask it to reproduce my work, and it can do so, even partially, I have sufficient proof to sue you and most likely win.

Technically, even involving "which license" is a bit moot, AGPLv3 or not, its a copyright violation to reproduce the work without license. GPL just makes the problem worse for them: anything involving any flavor of GPLv3 can end up snowballing with major GPL rightsholders enforcing the GPLv3 curing clause, as they will most likely also be able to convince the LLM to reproduce their works as well.

The real TL;DR is: they have not discovered an infinite money glitch. They must play by the same rules everyone else does, and they are not warning their users of the risk of using these.

BTW, if I was wrong about this, (IANAL after all), then so are the legal departments at companies across the world. Virtually all of them won't allow AGPLv3 programs in the door just because of the legal risk, and many of them won't allow the use of LLMs with the current state of the legal landscape.

▲

reissbaker 18 hours ago | parent | next [-]

No. You don't have sufficient proof to sue me simply for using an LLM, unless I actually use it to reproduce your work. If I don't use it to actually reproduce your work, you lose. And the onus is on you to prove that I did. Your claim was:

There is no reason why I can't sue every single developer to ever use an LLM and publish and/or distribute that code.

Simply proving that it's possible to reproduce your work with an LLM doesn't prove that I did, in fact, reproduce your work with an LLM. Just like you can't sue me for owning a VHS — even though it's possible that I could reproduce your work with one. The onus is on you to show that the person using the LLM has actually used it to violate your copyrighted work.

And running around blindly filing lawsuits claiming someone violated your copyright with no proof other than "they used an LLM to write their code!" will get your case thrown out immediately, and if you do it enough you'd likely get your lawyer disbarred (not that they'd agree to do it; there's no value in it for them, since you'll constantly lose). Just like blindly running around suing anyone who owns a VHS doesn't work. You have not discovered an infinite money glitch, or an infinite lawsuit glitch.

If you think you have, go talk to a lawyer. It's infinite free money, after all.

▲

DiabloD3 7 hours ago | parent [-]

Again, I shall correct the strawmanning of this: If you, the user, reproduce the work, then I can sue you for distributing the reproduced work. If you produce a tool/service whose only purpose is to reproduce works illegally, then I can sue you for making and distributing that tool and the government may force you to cease the production of the tool/service.

The onus would be on the toolmaker/service provider to prove there is legal uses of that tool/service and that their tool/service should not be destroyed. This is established case law, and people have lost those cases, and the law is heavily tilted in favor of the copyright holders.

The majority of LLMs are trained on pirated works. The companies are not disclosing this (as they would be immediately sued if they did so), and letting their users twist in the wind. Again, if those users use the LLM to reproduce a copyrighted work, all involved parties can be sued.

See the 1984 Betamax case (Sony Corp. of America v. Universal City Studios) on how the case law around this works: Sony was able to prove there is legitimate and legal uses for being able to record things that, thus can still produce Betamax products and cannot be sued for pirates pirating with Betamax products...

... but none of the LLM distributors or inference service providers have (or may be even able to) reach that and that doesn't make it legal to pirate things with Betamax, those people were still sued and sometimes even put in prison, and similarly, it would not free LLM users to continue pirating works using LLMs, it would only prevent OpenAI, Anthropic, etc, from being shut down.

If you still think this is an infinite money glitch, then it is exactly as you say, and this glitch has been being used against the American people by the rich for our entire lives.

▲

reissbaker 6 hours ago | parent [-]

You are just making things up. In the American court system you are innocent until proven guilty. There's no "established case law" that tool makers have to prove their tools can be used for whatever or else they're guilty — you have to prove they're guilty if you think they are. You don't even understand the cases you're citing! Sony was presumed innocent and the onus was on the plaintiffs, who failed. And you couldn't sue someone for simply owning a VCR or using one — notably, the plaintiffs were trying to sue Sony, the VCR maker, not everyone in America who owned a VCR.

In an even greater misunderstanding of the American legal system, you're using the Sony case to argue that you would win court cases against LLM users. The plaintiffs in the Sony case lost! This makes your pretend case even harder: the established precedent is in fact the opposite of what you want to do, which is randomly sue everyone who uses LLMs based on a shaky analysis that since it's possible to use them to infringe, everyone is guilty of infringement until proven innocent.

Moreover, at this point you're heavily resorting to motte and bailey, where you originally claimed you could sue anyone who used LLMs, and are now trying to back up and reduce that claim to just being able to sue OpenAI, Anthropic, and training companies.

Continuing this discussion feels pointless. Your claim was wrong. You can't blindly sue anyone who uses LLMs. If you think you can, go talk to a lawyer, since you seem to believe you've found a cheat code for money.

▲

DiabloD3 4 hours ago | parent [-]

> You can't blindly sue anyone who uses LLMs. Correct, that has been established as a strawman that is frequently used on HN.

>In an even greater misunderstanding of the American legal system, you're using the Sony case to argue that you would win court cases against LLM users.

Not at all. I said this is the only actual path for the companies to survive, if they can thread that legal needle. The users do not get the benefit of this. The FBI spent the better part of 3 decades busting small time pirates reproducing VHS tapes using perfectly legal (as per the case I quoted) tape decks.

Notice, not everybody has won this challenge, the Sony case merely shows you how high you have to jump. Many companies have been found liable for producing a tool or service whose primary use is to commit crimes or other illegal acts.

Companies that literally bent over backwards to comply with the law still got absolutely screwed, see what happened to Megaupload, and all they did was provide an encrypted offsite file storage system, and complied with all applicable laws promptly and without challenge.

Absolutely nothing stops the AI companies from being railroaded like that. However, I believe that they will attempt a Sony-like ruling to save their bacon, but throw their users under the bus.

>the established precedent is in fact the opposite of what you want to do,

Nope, just want to sue the code pirates. Everyone else can go enjoy their original AI slop as long as it comes from a 100% legally trained model and everybody keeps their hands clean.

>and are now trying to back up and reduce that claim

No, I literally just gave the Sony case as an example of reducing the claim into the other direction. The companies may in fact find a way to weasel out of this, but the users never will.

Another counter-example, btw, not that you asked for one, is Napster. Napster was ordered by a court to shut down their service as it's primary use was to facilitate piracy. While it is most likely OpenAI et al. will try to Sony their way out, they could end up like Napster instead, or worse, end up like Megaupload.

>everyone is guilty of infringement until proven innocent.

Although you are saying this in plain language, this is largely how copyright cases work in the US, even though, in theory, it should be innocent until proven guilty. However, that exact phrase is only meaningful in criminal cases. It is much more loose in civil cases, and the bar for winning a civil case is much lower.

Usually in a copyright case, the copyright owner is usually the plantiff (although not always!), and copyright owner plantiffs usually win these cases, even in cases where they really shouldn't have.

>Continuing this discussion feels pointless.

Yes it really does. Many people on HN clearly think it is okay to copyright-wash through LLMs, and that the output of LLMs are magically free of infringement by some unexplained handwaving.

You still have not explained how a user can have an LLM reproduce a copyrighted work, and then distribute it, and somehow the copyright owners cannot sue everyone involved, which is standard practice in such cases.

	▲	reissbaker 11 minutes ago \| parent [-]
		as long as it comes from a 100% legally trained model This is where your entire argument falls apart. You can't sue people just for using a tool that has the capability to violate copyright: you actually have to prove they did so. While it's technically true that you don't need to meet the bar of "proof" for civil cases, you're still not in luck: the bar is "preponderance of evidence," which you don't have if you're just blindly suing people based on using an LLM (and zero actual evidence of infringement). Using an LLM isn't illegal, so evidence that they used an LLM isn't evidence of anything that matters to your case: aka, you have nothing. All of your other examples similarly fall apart. For Napster cases, the RIAA had to show people actually violated copyright, not that they just had Napster installed or used it for non-copyrighted works. And again, you're trying to motte-and-bailey your way out of your original claim that you could blindly sue LLM users, as opposed to training companies who make the models. You couldn't sue Megaupload users who used Megaupload for random file storage — you could only sue Megaupload for knowingly not complying with copyright law. You really just don't understand the legal system. I'm not going to respond to this thread anymore. If you think you have a free money cheat code, go ahead and try to use it — you'll fail.

▲

Workaccount2 3 days ago | parent | prev [-]

I think you are confused about how LLMs train and store information. These models aren't archives of code and text, they are surprisingly small, especially relative to the training dataset.

A recent anthropic lawsuit decision also reaffirms that training on copyright is not a violation of copyright.[1]

However outputting copyright still would be a violation, the same as a person doing it.

Most artists can draw a batman symbol. Copyright means they can't monetize that ability. It doesn't mean they can't look at bat symbols.

[1]https://www.npr.org/2025/06/25/nx-s1-5445242/federal-rules-i...

▲

DiabloD3 3 days ago | parent [-]

No, I'm quite aware of how LLMs work. They are statistical models. They have, however, already been caught reproducing source material accurately. There is, inherently, no way to actually stop that if the only training data for a given output is a limited set of inputs. LLMs can and do exhibit extreme overfitting.

As for the Anthropic lawsuit, the piracy part of the case is continuing. Most models are built with pirated or unlicensed inputs. The part that was decided on, although the decision imo was wrong, only covers if someone CAN train a model.

At no point have I claimed you can't train one. The question is can you distribute one, and then use one. An LLM is not simplistic enough to be considered a phonebook, so they can't just handwave that away.

Saying an LLM can do that is like saying an artist can make a JPEG of a Batman symbol, and that's totally okay for them to distribute because the JPEG artifacts are transformative. LLMs ultimately are just a clever way of compressing data, and compressors are not transformative under the law, but possessing a compressor is not inherently illegal, nor is using one on copyrighted material for your own personal use.

▲

Workaccount2 2 days ago | parent [-]

They will just put a dumb copyright filter on the output, a la YouTube or other hosting services.

Again, it's illegal for artists to recreate copyright, it's not illegal for them to see it or know it. It's not like you cannot hire a guy because he can perfectly visualize Pikachu in his head.

The conflation of training on copyright being equivalent to distribution of copyright is so disingenuous, and thankfully the courts so far recognize that.

	▲	DiabloD3 2 days ago \| parent [-]
		YouTube et al's copyright detection is mostly nonfunctional. It can only match exactly the same input with very little leeway. Even resizing it to a wrong ratio, or changing audio sampling rate too far fucks up the detection. Its illegal for artists to distribute recreated copyright in a way that is not transformative. It isn't illegal to produce it and keep it to themselves. People also distribute models, they don't merely offer them as a service. However, if someone asks their model to produce a copyright violation, and it does so, the person that created and distributed the model (its the distribution that is the problem), the service that ran it (assuming it isn't local inference), and the person that asked for the violation to be created can all be looped into the legal case. This has happened before, before the world of AI. Even companies that 100% participated in the copyright regime, quickly performed takedowns, ran copyright detection to the best of their ability were sued and they lost because their users committed copyright violation using their services, even though the company did everything right and absolutely above board. The law is stacked against service providers on the Internet, as it essentially requires them to be omniscient and omnipotent. Such requirements are not levied against other service providers in other industries.