I read the headline as the copyright violation claim being core to the lawsuit.

The plaintiffs focused on exactly this part - removal of metadata - probably because it's the most likely to hold in courts. One judge remarked on it pretty explicitly, saying that it's just a proxy topic for the real issue of the usage of copyrighted material in model training.

I.e., it's some legalese trick, but "everyone knows" what's really at stake.

▲

0xcde4c3db 7 months ago | parent [-]

Yeah; I think that's essentially where the disconnect is rooted for me. It seems to me (a non-lawyer, to be clear) that it's damn hard to make the case for model training necessarily being meat-and-potatoes "infringement" as things are defined in Title 17 Chapter 1. I see it as firmly in the grey area between "a mere change of physical medium or deterministic mathematical transformation clearly isn't a defense against infringement on its own" and "giant toke come on, man, Terry Brooks was obviously just ripping off Tolkien". There might be a tension between what constitutes "substantial similarity" through analog and digital lenses, especially as the question pertains to those who actually distribute weights.

▲

kyledrake 7 months ago | parent [-]

I think you're at the heart of it, and you've humorously framed the grey area here and it's very weird. Sans a ruling that, for example, computers are too deterministic to be creative, copyright laws really seem to imply that LLM training is legal. Learning and then creating something new from what you learned isn't copyright infringement, so what's the legal argument here? A ruling declaring this copyright infringement is likely going to have crazy ripple effects going way beyond LLMs, something a good judge is going to be very mindful of.

Ultimately, this is probably going to require congress to create new laws to codify this.

▲

dragonwriter 7 months ago | parent | next [-]

> Learning and then creating something new from what you learned isn't copyright infringement, so what's the legal argument here?

The legal argument is that copying or creating what would otherwise be derivative works solely within a human brain is exempt because the human brain is not a medium wherein a configuration of information constitutes either a copy or a new work until it is set in another medium or performed publicly, whereas the storage of an artificial computer is absolutely such a medium (both of which are well-established law), so that the “learning” metaphor is not legally valid even if it is arguably a decent metaphor for some other purpose, furthermore, learning and then creating something new is often illegal, if the “something new” has sufficient proximity to the source material (that's the prohibition on unlicensed derivative works), and GenAI systems often do that and are (so the argument goes) sufficiently frequently used, and known to the service and model providers to be used. Intentionally to do that that, even were the training itself not a violation, the standards for contributory infringement are met in the provision of the certain models and/or services.

▲

mikae1 7 months ago | parent | prev [-]

According to us law, is the Internet Archive a library? I know they received a DMCA excemption.

If so, you could argue that your local library returns perfect copies of copyrighted works too. IMO it's somehow different from a business turning the results of their scraping into a profit machinery.

▲

kyledrake 7 months ago | parent [-]

My understanding is that there is no concept of a library license and that you just say you're a library and therefore become one, and whether your claim survives is more a product of social cultural acceptance than actual legal structures but someone is welcome to correct me.

The internet archive also scrapes the web for content, does not pay authors, the difference being that it spits out literal copies of the content it scraped, whereas an LLM fundamentally attempts to derive a new thing from the knowledge it obtains.

I just can't figure out how to plug this into copyright law. It feels like a new thing.

▲

quectophoton 7 months ago | parent [-]

Also, Google Translate, when used to translate web pages:

> does not pay authors

Check.

> it spits out literal copies of the content it scraped

Check.

> attempts to derive a new thing from the knowledge it obtains.

Check.

* Is interactive: Check.

* Can output text that sounds syntactically and grammatically correct, but a human can instantly say "that doesn't look right": Check.

* Changing one word in a sentence affects words in a completely different sentence, because that changed the context: Check.

▲

mikae1 7 months ago | parent [-]

Scraping with the intent to capitalize: no check.

▲

quectophoton 7 months ago | parent [-]

Even ignoring the fact that programmatic access to translation seems to require payment, or that its parent company is doing the scraping (similar to how one would use CommonCrawl instead of doing the scraping themselves), I am actually in favor of taking in to account the intent behind it.

"Give and take", "equal exchange", however people want to put it. I don't mind if someone uses publicly-accessible content and ignores its copyright to make another thing, as long as their result is publicly-accessible and they're prepared to have their copyright ignored in return. If you not only use the result of someone else, but also their process, then be prepared to have your process publicly-accessible too, with its copyright ignored. And so on.

That's why I don't mind "unofficial" translations or subtitles (both copyright violations as soon as they are distributed) appearing on multiple sites. That's why I respect open-source licenses of projects that respect them. That's why I pay for some open-source software even if I don't have to. That's why I give credit to artists even when I use an image that I didn't make myself as profile picture (either from the internet or because I paid for it).

That's also why I don't mind anyone ignoring my copyright as long as it's on "equal" terms ("if you vendor my code and pass it off as yours, that's tacit approval for someone else doing the same thing to you" kind of thing ("someone else" because, at least for code, it won't be me)).

I only gave very specific examples, but I hope I was able to explain what I mean.

The thing that I don't like, is the highly asymmetrical situation we're in with generative AI: because the result (the trained model) is not publicly accessible like a significant part of the content it was trained on; they only release a very limited interface to it.

	▲	kyledrake 6 months ago \| parent [-]
		> Even ignoring the fact that programmatic access to translation seems to require payment, or that its parent company is doing the scraping (similar to how one would use CommonCrawl instead of doing the scraping themselves), I am actually in favor of taking in to account the intent behind it. Does intent matter for the purposes of interpreting the laws here? I'm not criticizing your point, I'm genuinely curious if that matters (outside the context of fair use). I can certainly think of valid use cases that would not be considered fair use. > The thing that I don't like, is the highly asymmetrical situation we're in with generative AI: because the result (the trained model) is not publicly accessible like a significant part of the content it was trained on; they only release a very limited interface to it. I'm not sure that I agree with this one, given that most serious LLMs are free or very low cost to use, and in llama and phi-3's case pretty much just given away. Not a small gesture given the substantial expenses required to provide free access to some of these models.