Remix.run Logo
Xirdus 44 minutes ago

> Both would resolve to the same question, no?

Yes but my answer would be different. It can be either about what coding agents do (and you'll see that it breaks the social contract), or it can be about what the FOSS social contract is (and you'll argue that coding agents don't break it.) Lo and behold, it was the latter.

> There seems to be an implicit premise here that any work generated by an LLM whose training data includes a particular bit of code itself constitutes a redistribution of that code.

Not any work. But if a specific work was generated based on a specific open source work, then according to the social contract that binds non-AI code generators such as transpilers, the output is derivative and should follow the license of that open source work.

There's also the question of whether the model itself is a redistribution. For every other lossy compression algorithm in history, the answer is a resounding yes. Is a model meaningfully different from a hypercompressed corpus of its learning data?

The social contract of the open source (not to be confused with the legal contract of GPL, MIT etc.) is that developers give users software that they can use and modify in any way they want, and in exchange the users give the developer recognition and help with development and maintance, as well as give each other the assurance that the software will remain available to them and any future users.

AI gives the user all the benefits of using open source software with none of the obligations that come from using open source software. Developer gains nothing from going open source. It makes no sense for any developer to go open source. Social comtract breaks down, and it's all because AI users didn't hold up their half of the bargain.

Gormo 4 minutes ago | parent [-]

> But if a specific work was generated based on a specific open source work, then according to the social contract that binds non-AI code generators such as transpilers, the output is derivative and should follow the license of that open source work.

I don't disagree with the premise that any LLM that is cloning code wholesale from a third-party repo is creating a derivative work, and the license terms apply to it.

But I also don't agree that non-AI code generators such as transpilers are in the same category as LLMs -- a deterministic process that is simply parsing input from a single source and outputting it in a new form is not the same thing as a stochastic process that interpolates patterns from multiple sources and then uses those patterns to generate novel outputs.

> There's also the question of whether the model itself is a redistribution. For every other lossy compression algorithm in history, the answer is a resounding yes. Is a model meaningfully different from a hypercompressed corpus of its learning data?

The model isn't a lossy compression archive that merely represents a collection of pre-existing works in parallel to each other. It's a probability matrix that relates together uniquely isolatable units of data to each other across the entire collection.

If I build a Markov chain based on a statistical analysis of word sequences in Hamlet, and then use it to produce a new sentence that isn't found in the text of that work, I have not created a derivative work of Hamlet under any applicable sense of that term.

> The social contract of the open source (not to be confused with the legal contract of GPL, MIT etc.) is that developers give users software that they can use and modify in any way they want, and in exchange the users give the developer recognition and help with development and maintance, as well as give each other the assurance that the software will remain available to them and any future users.

I don't think that is generally true. There's always been a hope and expectation that some subset of users would contribute back to the project in the ways you're describing, but never a sense of there being any obligation to do so. Only a fraction of FOSS users have ever contributed to back to the projects whose software they use.

There's always been both a social and legal obligation to properly attribute authors and abide by license terms when redistributing or forking FOSS code, but neither obligation has ever applied when learning programming techniques from FOSS code in order to write your own software. And the way LLMs are designed to work is more similar to the latter than to the former.

But in cases where LLMs actually are acting in ways similar to the former, I agree that they should be held accountable both socially and legally.