Remix.run Logo
phplovesong 3 hours ago

We need a new license that forbids all training. That is the only way to stop big corporations from doing this.

maxloh 2 hours ago | parent | next [-]

To my understanding, if the material is publicly available or obtained legally (i.e., not pirated), then training a model with it falls under fair use, at least in the US and some other jurisdictions.

If the training is established as fair use, the underlying license doesn't really matter. The term you added would likely be void or deemed unenforceable if someone ever brought it to a court.

rileymat2 2 hours ago | parent | next [-]

It depends on the license terms, if you have a license that allowed you to get it legally where you agreed to those terms it would not be legal for that purpose.

But this is all grey area… https://www.authorsalliance.org/2023/02/23/fair-use-week-202...

justin_murray 2 hours ago | parent | prev | next [-]

This is at least murky, since a lot of pirated material is “publicly available”. Certainly some has ended up in the training data.

michaelmrose 2 hours ago | parent [-]

It isn't? You have to break the law to get it. It's publicly available like your TV is if I were to break into your house and avoid getting shot.

basilgohar 2 hours ago | parent | next [-]

That isn't even remotely a sensible analogy. Equating copyright violation with stealing physical property is an extremely failed metaphor.

MangoToupe 2 hours ago | parent | prev [-]

Maybe you have some legalistic point that escapes comprehension, but I certainly consider my house to be much private and the internet public.

colechristensen 2 hours ago | parent | prev [-]

I wouldn't say this is settled law, but it looks like this is one of the likely outcomes. It might not be possible to write a license to prevent training.

munchler 2 hours ago | parent | prev | next [-]

By that logic, humans would also be prevented from “training” on (i.e. learning from) such code. Hard to see how this could be a valid license.

psychoslave 2 hours ago | parent | next [-]

Isn’t it the very reason why we need cleanroom software engineering:

https://en.wikipedia.org/wiki/Cleanroom_software_engineering

codedokode 2 hours ago | parent | prev [-]

Bad analogy, probably made up by capitalists to confuse people. ML models cannot and do not learn. "learning" is a name of a process, when model developer downloads pirated material and processes it with an algorithm (computes parameters from it).

Also, humans do not need to read million of pirated books to learn to talk. And a human artist doesn't need to steal million pictures to learn to draw.

1gn15 an hour ago | parent [-]

> And a human artist doesn't need to steal million pictures to learn to draw.

They... do? Not just pictures, but also real life data, which is a lot more data than an average modern ML system has. An average artist has probably seen- stolen millions of pictures from their social media feeds over their lifetime.

Also, claiming to be anti-capitalist while defending one of the most offensive types of private property there is. The whole point of anti-capitalism is being anti private property. And copyright is private property because it gives you power over others. You must be against copyright and be against the concept of "stealing pictures" if you are to be an anti-capitalist.

WithinReason 2 hours ago | parent | prev | next [-]

Wouldn't it be still legal to train on the data due to fair use?

gus_massa 2 hours ago | parent [-]

I don't think it's fair use, but everyone on Earth disagree with me. So even with the standard default licence that prohibits absolutely everything, the humanity-1 consider it fair use.

2 hours ago | parent | next [-]
[deleted]
justin_murray 2 hours ago | parent | prev [-]

Honest question: why don’t you think it is fair use?

I can see how it pushes the boundary, but I can’t lay out logic that it’s not. The code has been publish for the public to see. I’m always allowed to read it, remember it, tell my friends about it. Certainly, this is what the author hoped I would do. Otherwise, wouldn’t they have kept it to themselves?

These agents are just doing a more sophisticated, faster version of that same act.

gus_massa an hour ago | parent | next [-]

Some project like Wine forbids you to contribute if you ever have seen the source of MS Windows [1]. The meatball inside your head is tainted.

I don't remember the exact case now, but someone was cloning a program (Lotus123 -> Quatro or Excel???). They printed every single screen and made a team write a full specification in English. Later another separate team look at the screenshots and text and reimplement it. Apparently meatballs can get tainted, but the plain English text loophole was safe enough.

[1] From https://gitlab.winehq.org/wine/wine/-/wikis/Developer-FAQ#wh...

> Who can't contribute to Wine?

> Some people cannot contribute to Wine because of potential copyright violation. This would be anyone who has seen Microsoft Windows source code (stolen, under an NDA, disassembled, or otherwise). There are some exceptions for the source code of add-on components (ATL, MFC, msvcrt); see the next question.

mixedbit 2 hours ago | parent | prev [-]

Before LLMs programmers had pretty good intuition what GPL license allowed for. It is of course clear that you cannot release a closed source program with GPL code integrated into it. I think it was also quite clear, that you cannot legally incorporate GPL code into such a program, by making changes here and there, renaming some stuff, and moving things around, but this is pretty much what LLMs are doing. When humans do it intentionally, it is violation of the license, when it is automated and done on a huge scale, is it really fair use?

WithinReason an hour ago | parent [-]

> this is pretty much what LLMs are doing

I think this is the part where we disagree. Have you used LLMs, or is this based on something you read?

mixedbit an hour ago | parent [-]

Do you honestly believe there are people on this board who haven't used LLMs? Ridiculing someone you disagree with is a poor way to make an argument.

WithinReason 11 minutes ago | parent [-]

lots of people on this board are philosophically opposed to them so it was a reasonable question, especially in light of your description of them

James_K 2 hours ago | parent | prev | next [-]

Would such a license fall under the definition of free software? Difficult to say. Counter-proposition: a license which permits training if the model is fully open.

amszmidt 2 hours ago | parent | next [-]

It isn't the difficult, a license that forbids how the program is used is a non-free software license.

"The freedom to run the program as you wish, for any purpose (freedom 0)."

Orygin 2 hours ago | parent | next [-]

Yet the GPL imposes requirements for me and we consider it free software.

You are still free to train on the licensed work, BUT you must meet the requirements (just like the GPL), which would include making the model open source/weight.

helterskelter 2 hours ago | parent | prev [-]

Running the program and analyzing the source code are two different things...?

amszmidt an hour ago | parent [-]

In the context of Free Software, yes. Freedom one is about the right to study a program.

Orygin 2 hours ago | parent | prev | next [-]

My next project will be released under a GPL-like license with exactly this condition added. If you train a model on this code, the model must be open source & open weights

fouronnes3 2 hours ago | parent [-]

Not sure why the FSF or any other organization hasn't released a license like this years ago already.

amszmidt 2 hours ago | parent [-]

Because it would violate freedom zero. Adding such terms to the GNU GPL would also mean that you can remove them, they would be considered "further restrictions" and can be removed (see section 7 of the GNU GPL version 3).

Orygin 2 hours ago | parent [-]

Freedom 0 is not violated. GPL includes restrictions for how you can use the software, yet it's still open source.

You can do whatever you want with the software, BUT you must do a few things. For GPL it's keeping the license, distributing the source, etc. Why can't we have a different license with the same kind of restrictions, but also "Models trained on this licensed work must be open source".

Edit: Plus the license would not be "GPL+restriction" but a new license altogether, which includes the requirements for models to be open.

amszmidt 2 hours ago | parent [-]

That is not really correct, the GNU GPL doesn't have any terms whatsoever on how you can use, or modify the program to do things. You're free to make a GNU GPL program do anything (i.e., use).

I suggest a careful reading of the GNU GPL, or the definition of Free Software, where this is carefully explained.

Orygin an hour ago | parent [-]

> You may convey a work based on the Program, or the modifications to produce it from the Program, in the form of source code under the terms of section 4, provided that you also meet all of these conditions:

"A work based on the program" can be defined to include AI models (just define it, it's your contract). "All of these conditions" can include conveying the AI model in an open source license.

I'm not restricting your ability to use the program/code to train an AI. I'm imposing conditions (the same as the GPL does for code) onto the AI model that is derivative of the licensed code.

Edit: I know it may not be the best section (the one after regarding non-source forms could be better) but in spirit, it's exactly the same imo as GPL forcing you to keep the GPL license on the work

amszmidt an hour ago | parent [-]

I think maybe you're mixing up distribution and running a program, at least taking your initial comment into account, "if you train/run/use a model, it must be open source".

Orygin 39 minutes ago | parent [-]

I should have been more precise: "If you train and distribute an AI model on this work, it must use the same license as the work".

Using AGPL as the base instead of GPL (where network access is distribution), any user of the software will have the rights to the source code of the AI model and weights.

My goal is not to impose more restrictions to the AI maker, but to guarantee rights to the user of software that was trained on my open source code.

tomrod 2 hours ago | parent | prev [-]

Model weights, source, and output.

scotty79 2 hours ago | parent | prev [-]

We need a ruling that LLM generated code enters public domain automatically and can't be covered by any license.

palata 2 hours ago | parent | next [-]

But then we would need a way to prove that some code was LLM generated, right?

Like if I copy-paste GPL-licenced code, the way you realise that I copy-pasted it is because 1) you can see it and 2) the GPL-licenced code exists. But when code is LLM generated, it is "new". If I claim I wrote it, how would you oppose that?

michaelmrose 2 hours ago | parent | prev [-]

Laws exist to protect those who make and have money. If trillions could be made harvesting your kids kidneys it would be legal.

basilgohar 2 hours ago | parent [-]

It's done extrajudicially in warzones such as Palestine where hostages are returned from Israeli jails, with missing organs, dead or alive [0].

[0] https://factually.co/fact-checks/justice/evidence-investigat...