Remix.run Logo
adampunk an hour ago

I don't think this argument is a winner. It fails on a few grounds:

First, unless you can point to regurgitation of memorized code, you're not able to make an argument about distribution or replication. This is part of the problem that most publishers are having with prose text and LLMs. Modern LLMs don't memorize harry potter like GPT3 did. The memorization older models showed came from problems in the training data, e.g. harry potter and people writing about harry potter are extraordinarily over-represented. It's similar to how with stable diffusion you could prompt for anything in the region of "Van Gogh's Starry Night" and get it, since it was in the training data 50-100 different ways. You can't reliably do this with Opus or GPT5. If they're not redistributing the code verbatim, they're not in violation of the license. One could argue that the models produce "derivative works, but..."

The derivative works argument is inapt. The point of it is to disrupt someone's end-run around the license by saying that building on top of GPL code is not enough to non-GPL it. We imagine this will still work for LLMs because of the GPLs virality--I can't enclose a critical GPL module in non-GPL code and not release the GPL code. But the models aren't DOING THAT. They're not reaching for XYZ GPL'd project to build with. They're vibing out a sparsely connected network of information about literally trillions of lines of software. What comes out is a mishmash of code from here and there, and only coincidentally resembles GPL code, when it does. In order to make this argument work, you need a theory of how LLMs are trained and operate that supports it. Regardless of whether or not one of those theories exist, in court, you'd need to show that your theory was better than the company's expert witness's theory. Good luck.

Second, infringement would need discovery to uncover and would be contingent on user input. This is why the NYT sued for deleted user prompts to ChatGPT--the plaintiffs can't show in public that the content is infringing, so they need to seek discovery to find evidence. That's only going to work in cases where you survive a motion to dismiss--which is EXACTLY where a few of these suits have failed. You need to show first that you can succeed on the merits, then you proceed. That will cut down many of these challenges since they just can't show the actual infringement.

Third, and I think this is the most important, the license protections here are enforced by *copyright*. For copyright it very much matters if something is lifted verbatim vs modified. It is not like patent protection where things like clean room design are shown to have mattered to real courts on real matters. In additional contrast to patents, copyright doesn't care if the outcome is close. That's very much a concern for patents. If I patent a gizmo and you produce a gizmo that operates through nearly identical mechanisms to those I patented, then you can be sued--they don't need to be exact. If I write a novel about a boy wizard with glasses who takes a train to a school in Scotland and you write a novel about a boy wizard with glasses who takes a boat to a school in Inishmurray, I can't sue you for copyright infringement. You need to copy the words I wrote and distribute them to rise to a violation.