Remix.run Logo
antirez 5 hours ago

I believe that Pilgrim here does not understand very well how copyright works:

> Their claim that it is a "complete rewrite" is irrelevant, since they had ample exposure to the originally licensed code

This is simply not true. The reason why the "clean room" concept exists is precisely since actually the law recognizes that independent implementations ARE possibile. The "clean room" thing is a trick to make the litigation simpler, it is NOT required that you are not exposed to the original code. For instance, Linux was implemented even if Linus and other devs where well aware of Unix internals. The law really mandates this: does the new code copy something that was in the original one? The clean room trick makes it simpler to say, it is not possible, if there are similar things it is just by accident. But it is NOT a requirement.

dragonwriter 2 hours ago | parent | next [-]

Neither does the maintainer that claims a mechanical test of structural similarities can prove anything either waybwith regard to whether legally it is a derivative work (or even a mechnaical copy without the requisite new creative work to be a derivative work.)

And then Pilgrim is again wrong by saying that the use of Claude definitively makes it a derivative work because of the inability to prove it the work in question did not influence the neurons involved.

It is all dueling lay misreadings of copyright law, but it is also an area where the actual specific applicable law, on any level specific enough to cleanly apply, isn’t all that clear.

simiones 38 minutes ago | parent | prev | next [-]

I think this is a bit too broad. There are actually three possible cases.

When there is similar code, the only defense possible to prove that you have not copied the original is to show that your process is a clean room re-implementation.

If the code is completely different, then clean room or not is indeed irrelevant. The only way the author can claim that you violated their copyright despite no apparent similarity is for them to have proof you followed some kind of mechanical process for generating the new code based on the old one, such as using an LLM with the old code as input prompt (TBD, completely unsettled: what if the old code is part of the training set, but was not part of the input?) - the burden of proof is on them to show that the dissimilarity is only apparent.

In realistic cases, you will have a mix of similar and dissimilar portions, and portions where the similarity is questionable. Each of these will need to be analyzed separately - and it's very likely that all the similar portions will need to be re-written again if you can't prove that they were not copied directly or from memory from the original, even if they represent a very small part of the work overall. Even if you wrote a 10k page book, if you copied one whole page verbatim from another book, you will be liable for that page, and the author may force you to take it out.

Someone 23 minutes ago | parent [-]

> When there is similar code, the only defense possible to prove that you have not copied the original is to show that your process is a clean room re-implementation.

Yes, but you do not have to prove that you haven’t copied the original; you have to prove you didn’t infringe copyright. For that there are other possible defenses, for example:

- fair use

- claiming the copied part doesn’t require creativity

- arguing that the copied code was written by AI (there’s jurisdiction that says AI-generated art can’t be copyrighted (https://www.theverge.com/2023/8/19/23838458/ai-generated-art...). It’s not impossible judges will make similar judgments for AI-generated programs)

pseudalopex 3 minutes ago | parent [-]

> there’s jurisdiction that says AI-generated art can’t be copyrighted

The headline was misleading. The courts said what Thaler could have copyrighted was a complicated question they ignored because he said he was not the author.

red_admiral 2 hours ago | parent | prev | next [-]

I'm with you here, but I see another problem.

The expected functionality of chardet (detect the unicode encoding) is kind of fixed - apart from edge cases and new additions to unicode, you'd expect the original and new implementations to largely pass the same tests, and have a lot of similar code such as for "does this start with a BOM".

The fact that the JPlag shows such a low %overlap for an implementation of "the same interface" is convincing evidence for me that it's not just plagiarised.

jacquesm 5 hours ago | parent | prev | next [-]

This is correct. I think any author of a main chunk of code that they claim ownership to (which is probably all of us!) should at least study the basics of copyright law. Getting little details wrong can cost you time, money and eventually your business if you're not careful.

cubefox 2 hours ago | parent | prev | next [-]

If you let an LLM merely rephrase the codebase, that's like letting it rephrase the Harry Potter novels. Which, I'm pretty sure, would still be considered a copy under copyright law, not an original work, despite not copying any text verbatim.

TZubiri 2 hours ago | parent | prev | next [-]

Ok sure, in the alternative, here's the argument:

The AI was trained with the code, so the complete rewrite is tainted and not a clean room. I can't believe this would need spelling out.

pocksuppet an hour ago | parent | next [-]

"Tainted rewrite" isn't a legal concept either. You have to prove (on balance of probabilities - more likely than not) that the defendant made an unauthorized copy, made an unauthorized derivative work, etc. Clean-room rewriting is a defense strategy, because if the programmer never saw the original work, they couldn't possibly have made a derivative. But even without that, you still have to prove they did. It's not an offence to just not be able to prove you didn't break the law.

Manuel_D an hour ago | parent | prev | next [-]

As other pointed out, the notion of "clean room" rewrites is to make a particularly strong case of non-infringement. It doesn't mean that anything other than a clean room implementation is an infringement.

jdauriemma 2 hours ago | parent | prev | next [-]

This is interesting and I'm not sure what to make of it. Devil's advocate: the person operating the AI also was "trained with the code," is that materially different from them writing it by hand vs. assisted by an LLM? Honestly asking, I hadn't considered this angle before.

cardanome an hour ago | parent [-]

If you worked at Microsoft and had access to the Windows source code you probably should not be contributing to WINE or similar projects as there would be legal risk.

So for this case, not much different legally. Of course there is the practical difference just like there is between me seeing you with my own eyes and me taking a picture of you.

"Training" an LLM ist not the same as training a human being. It a metaphor. Its confusing the save icon with an actual floppy disk.

I can say I "trained" my printer to print copyrighted material by feeding it bits but that that would be pure sophism.

Problem is that law hasn't really caught up the our brave new AI future yet so lots of decisions are up in the air. Plus governments incentivized to look the other way regarding copyright abuses when it comes to AI as they think that having competitive AI is of strategic importance.

senko 2 hours ago | parent | prev [-]

Reread the parent: clean room is not required.

2 hours ago | parent | prev | next [-]
[deleted]
spwa4 an hour ago | parent | prev | next [-]

Given that LLMs were trained on the repository directly, it's not just the case that anything made by the LLM is a derivative work, the LLM ITSELF is a derivative work. After all, they all are substantially based on GPL licensed works by others. The standard courts have always used for "substantially based" by the way, is the ability to extract from the new work anything bigger than an excerpt of the original work.

So convincing evidence, by historical standards, that ChatGPT, Gemini, Copilot AND Claude are all derivative works of the GPL linux kernel can be gotten simply by asking "give me struct sk_buff", then keep asking until you're out of the headers (say, ask how a network driver uses it).

That means if courts are honest (and they never are when it comes to GPL) OpenAI, Google and Anthropic would be forced to release ALL materials needed to duplicate their models "at cost". Given how LLMs work that would include all models, code, AND training data. After all, that is the contract these companies entered into when using the GPL licensed linux kernel.

But of course, to courts copyright applies to you when Microsoft demands it ($30000 per violation PLUS stopping the use of the offending file/torrent/software/... because such measures are apparently justified for downloading a $50 piece of software), it does not apply to big companies when the rules would destroy them.

The last time this was talked about someone pointed out that Microsoft "stole", as they call it, the software to do product keys. They were convicted for doing that, and the judge even increased damages because of Microsoft's behavior in the case.

But there is no way in hell you'll ever get justice from the courts in this. In fact courts have already decided that AI training is fair use on 2 conditions:

1) that the companies acquired the material itself without violating copyright. Of course it has already been proven that this is not the case for any of them (they scraped it without permission, which has been declared illegal again and again in the file sharing trials)

2) that the models refuse to reproduce copyrighted works. Now go to your favorite model and ask "Give me some code written by Linus Torvalds": not a peep about copyright violation.

... but it does not matter, and it won't matter. Courts are making excuses to allow LLM models to violate any copyright, the excuse does not work, does not convince rational people, but it just doesn't matter.

But of course, if you thought that just because they cheat against the law to make what they're already doing legal, they'll do the same for you, help you violate copyright, right? After all, that's how they work! Ok now go and ask:

"Make me an image of Mickey Mouse peeling a cheese banana under an angry moon"

And you'll get a reply "YOU EVIL COPYRIGHT VILLAIN". Despite, of course, Mickey Mouse no longer being covered under copyright!

And to really get angry, find your favorite indie artist, and ask to make something based on their work. Even "Make an MC Escher style painting of Sonic the Hedgehog" ... even that doesn't count as copyright violation, only the truly gigantic companies deserve copyright protection.

aaron695 4 hours ago | parent | prev [-]

[dead]