Remix.run Logo
qarl 5 hours ago

No, if you read the article, the point is in the training, not the reproduction.

That's what all these lawsuits are about - it's the training not the reproduction. I already agreed in my first comment that the reproduction is off limits.

In this case, it appears that Meta torrented illegal copies of the work to do the training. Obviously that's bad. But conflating that with training itself doesn't follow.

SahAssar 4 hours ago | parent | next [-]

The point of these lawsuits is the piracy. My parent comment was about the general situation, not this specific article.

Pirating content is illegal, regardless of if it is to train an LLM.

Usage of LLMs trained on unlicensed content (basically all of them) might or might not be illegal.

Using any method to reproduce a copyrighted work by using that original as input in a way that supplants the market value of the original is probably illegal.

At least that is my rudimentary understanding.

qarl 3 hours ago | parent | next [-]

Well - maybe so. But the common belief is that training itself is a violation of copyright, no matter how it's done. That's the argument I'm countering here.

SahAssar 3 hours ago | parent [-]

The issue is that the trainers have not sought licenses for the data and instead outright pirated it.

I don't think anyone thinks that all training is a copyright violation if all the training data is licensed. For example a LLM trained on CC0 content would be fine with basically everyone.

The problem is that training happens on data that is not licensed for that use. Some of that data also is pirated which makes it even clearer that it is illegal.

qarl 3 hours ago | parent [-]

But why should separate licensing be required at all? A search engine reads and indexes every word of every page it crawls. No one argues that requires licensing, only that the outputs must respect copyright. Why should training be different?

SahAssar 2 hours ago | parent [-]

When google starting outputting summaries people asked the same questions.

If you supplant the value of the original with the original as input then you probably have some legal questions to answer.

qarl 26 minutes ago | parent [-]

But that's about the output, not the training. We agree: outputs that supplant the original are the problem. A model constrained to produce only fair use outputs causes no such harm — regardless of what it was trained on.

lobf 3 hours ago | parent | prev [-]

Sharing copyrighted material is illegal. Presumably, if Meta blocked all seeding on the torrents they downloaded, they wouldn't have broken copyright, right?

doublescoop 5 hours ago | parent | prev | next [-]

If copyright law doesn't extend to the works being used for training, why should it extend to the model that is produced as a result? AI model creators have set up an ethical scenario where the right thing to do is ignore copyright laws when it comes to AI, which includes model use. It might never be legal, but it has become ethical to pirate models, distill them against ToS, etc.

qarl 5 hours ago | parent [-]

I'm not sure I follow. Can you say it a different way?

SahAssar 3 hours ago | parent [-]

I think the parent is basically saying that if you can legally pirate a book to train a LLM why can't you legally pirate a LLM model?

It's a "rules for thee and not for me" argument.

qarl 3 hours ago | parent [-]

AH. Thank you.

triceratops 4 hours ago | parent | prev [-]

Training requires making copies. Even if Meta had purchased each work they'd have had to make copies of it to distribute around the training cluster.

qarl 4 hours ago | parent [-]

Does it though? If they bought a copy for each machine?

triceratops 3 hours ago | parent [-]

Then no copying happened so they'd be on firmer legal ground.

qarl 3 hours ago | parent [-]

Good, we're agreed. My only point here is that training is not inherently a copyright violation.