Genuine question: if I train my model with copyleft material, how do you prove I did?

Like if there is no way to trace it back to the original material, does it make sense to regulate it? Not that I like the idea, just wondering.

I have been thinking for a while that LLMs are copyright-laundering machines, and I am not sure if there is anything we can do about it other than accepting that it fundamentally changes what copyright is. Should I keep open sourcing my code now that the licence doesn't matter anymore? Is it worth writing blog posts now that it will just feed the LLMs that people use? etc.

▲

bwfan123 9 minutes ago | parent | next [-]

Sometime, LLMs actually generate copyright headers as well in their output - lol - like in this PR which was the subject of a recent HN post [1]

https://github.com/ocaml/ocaml/pull/14369/files#diff-062dbbe...

[1] https://news.ycombinator.com/item?id=46039274

▲

friendzis an hour ago | parent | prev | next [-]

> Genuine question: if I train my model with copyleft material, how do you prove I did?

An inverse of this question is arguably even more relevant: how do you prove that the output of your model is not copyrighted (or otherwise encumbered) material?

In other words, even if your model was trained strictly on copyleft material, but properly prompted outputs a copyrighted work is it copyright infringement and if so by whom?

Do not limit your thoughts to text only. "Draw me a cartoon picture of an anthropomorphic with round black ears, red shorts and yellow boots". Does it matter if the training set was all copyleft if the final output is indistinguishable from a copyrighted character?

▲

PaulKeeble an hour ago | parent | prev | next [-]

Its why I stopped contributing to open source work. Its pretty clear in the age of LLMs that this breach of the license under which it is written will be allowed to continue and that open source code will be turned into commercial products.

▲

ACCount37 24 minutes ago | parent | prev | next [-]

You need low level access to the AI in question, and a lot of compute, but for most AI types, you can infer whether a given data fragment was in the training set.

It's much easier to do that for the data that was repeated many times across the dataset. Many pieces of GPL software are likely to fall under that.

Now, would that be enough to put the entire AI under GPL? I doubt it.

▲

blibble 17 minutes ago | parent | prev | next [-]

> Genuine question: if I train my model with copyleft material, how do you prove I did?

discovery via lawyers

▲

freedomben 2 hours ago | parent | prev | next [-]

I've thought about this as well, especially for the case when it's a company owned product that is AGPLed. It's a really tough situation, because the last thing we want is competitors to come in and LLM wash our code to benefit their own product. I think this is a real risk.

On the other side, I deeply believe in the values of free software. My general stance is that all applications I open source are GPL or AGPL, and any libraries I open source are MIT. For the libraries, obviously anyone is free to use them, and if they want to rewrite them with an LLM more power to them. For the applications though, I see that as a violation of the license.

At the end of the day, I have competing values and needs and have to make a choice. The choice I've made for now is that for the vast majority of things, I'm still open sourcing them. The gift to humanity and the guarantee to the users freedom is more important to me than a theoretical threat. The one exception is anything that is truly a risk of getting lifted and used directly by competitors. I have not figured out an answer to this one yet, so for now I'm keeping it AGPL but not publicly distributing the code. I obviously still make the full code available to customers, and at least for now I've decided to trust my customers.

I think this is an issue we have to take week by week. I don't want to let fear of things cause us to make suboptimal decisions now. When there's an actual event that causes a reevaluation, I'll go from there.

▲

basilgohar 2 hours ago | parent | prev | next [-]

Maybe we should requiring training data be published or at least referenced.

▲

luqtas an hour ago | parent | prev | next [-]

genuine question: why you are training your model with content that explicitly will have requirements violated if you do?

	▲	1gn15 an hour ago \| parent [-]
		out of pure spite for hypocritical "hackers"

▲

ForHackernews 31 minutes ago | parent | prev | next [-]

https://www.penny-arcade.com/comic/2024/01/19/fypm

Anything you produce will be consumed and regurgitated by the machine. It's a personal question for everyone whether you choose to keep providing grist for their mills.

▲

mistrial9 2 hours ago | parent | prev [-]

> Should I keep open sourcing my code now that the licence doesn't matter anymore?

your LICENSE matters in similar ways that it mattered before LLMs. LICENSE adherence is part of intellectual property law and practice. A popular engine may be popular, but not all cases at all times. Do not despair!