Interesting. Two key quotes:

> It is unclear if the Intercept ruling will embolden other publications to consider DMCA litigation; few publications have followed in their footsteps so far. As time goes on, there is concern that new suits against OpenAI would be vulnerable to statute of limitations restrictions, particularly if news publishers want to cite the training data sets underlying ChatGPT. But the ruling is one signal that Loevy & Loevy is narrowing in on a specific DMCA claim that can actually stand up in court.

> Like The Intercept, Raw Story and AlterNet are asking for $2,500 in damages for each instance that OpenAI allegedly removed DMCA-protected information in its training data sets. If damages are calculated based on each individual article allegedly used to train ChatGPT, it could quickly balloon to tens of thousands of violations.

Tens of thousands of violations at $2500 each would amount to tens of millions of dollars in damages. I am not familiar with this field, does anyone have a sense of whether the total cost of retraining (without these alleged DMCA violations) might compare to these damages?

▲

Xelynega a year ago | parent [-]

If you're going to retrain your model because of this ruling, wouldn't it make sense to remove all DMCA protected content from your training data instead of just the one you were most recently sued for(especially if it sets precedent)

▲

sandworm101 a year ago | parent | next [-]

But all content is DMCA protected. Avoiding copyrighted content means not having content as all material is automatically copyrighted. One would be limited to licensed content, which is another minefield.

The apparant loophole is between copyrighted work and copyrighted work that is also registered. But registration can occur at any time, meaning there is little practical difference. Unless you have perfect licenses for all your training data, which nobody does, you have to accept the risk of copyright suits.

▲

Xelynega a year ago | parent | next [-]

Yes, that's how every other industry that redistributes content works.

You have to license content you want to use, you cant just use it for free because it's on the internet.

Netflix doesn't just start hosting shows and hope they don't get a copyright suit...

▲

YetAnotherNick a year ago | parent [-]

In almost all cases before gen AI, scraping was found to be legal unless the bot accepted terms of service, in which case bot is bound by ToS. The biggest and most clear is [1]. People have been scraping internet for as long as internet existed.

[1]: https://en.wikipedia.org/wiki/HiQ_Labs_v._LinkedIn

	▲	account42 a year ago \| parent [-]
		Before gen AI, scraping mostly wasn't about copyrightable data but about finding facts. Scraping doesn't magically make copyright infringement legal.

▲

noitpmeder a year ago | parent | prev [-]

It's insane to me that people don't agree that you need to require a license to train your proprietary for-profit model on someone else's work.

▲

jsheard a year ago | parent | prev | next [-]

It would make sense from a legal standpoint, but I don't think they could do that without massively regressing their models performance to the point that it would jeopardize their viability as a company.

▲

Xelynega a year ago | parent | next [-]

I agree, just want to make sure "they can't stop doing illegal things or they wouldn't be a success" is said out loud instead of left to subtext.

▲

IshKebab a year ago | parent | next [-]

It's not definitely illegal yet.

	▲	yyuugg a year ago \| parent [-]
		It's also definitely not not illegal either. Case law is very much tbd.

▲

CuriouslyC a year ago | parent | prev [-]

[flagged]

▲

Xelynega a year ago | parent [-]

If we really want to be technical, in common law systems anything is legal as long as the highest court to challenge it decides it's legal.

I guess I should have used the phrase "common sense stealing in any other context" to be more precise?

▲

krisoft a year ago | parent [-]

> I guess I should have used the phrase "common sense stealing in any other context" to be more precise?

Clearly not common sense stealing. The Intercept was not deprived of their content. If OpenAI would have sneaked into their office and server farm and took all the hard drives and paper copies with the content that would be "common sense stealing".

▲

TheOtherHobbes a year ago | parent [-]

Very much common sense copyright violation though.

It's that simple. There is no "Yes but you still have your book" argument, because copyright is a claim on commercial value, not a claim on instantiation.

There's some minimal wiggle room for fair use, but clearly making an electronic copy and creating a condensed electronic version of the content - no matter how abstracted - and using it for profit is not fair use.

▲

chii a year ago | parent [-]

> Copyright means you're not allowed to copy something without permission.

but is training an AI copying? And if so, why isn't someone learning from said work not considered copying in their brain?

▲

throw646577 a year ago | parent | next [-]

> but is training an AI copying?

If the AI produces chunks of training set nearly verbatim when prompted, it looks like copying.

> And if so, why isn't someone learning from said work not considered copying in their brain?

Well, their brain, while learning, is not someone's published work product, for one thing. This should be obvious.

But their brain can violate copyright by producing work as the output of that learning, and be guilty of plagiarism, etc. If I memorise a passage of your copyrighted book when I am a child, and then write it in my book when I am an adult, I've infringed.

The fact that most jurisdictions don't consider the work of an AI to be copyrightable does not mean it cannot ever be infringing.

▲

CuriouslyC a year ago | parent | next [-]

The output of a model can be copyright violation. In fact, even if the model was never trained on copyright content, if I provided copyright text then told the model to regurgitate it verbatim that would be a violation.

That does not make the model copyright violation itself.

	▲	throw646577 a year ago \| parent [-]
		This is is sort of like the argument against a blank tape levy or a tape copier tax, which is a reasonable argument in the context of the hardware. But an LLM doesn't just enable direct duplication, it (well its model) contains it. If software had a meaningful distribution cost or per-unit sale cost, a blank tape tax would be very appropriate for LLM sales. But instead OpenAI is operating a for-pay duplication service where authors don't get a share of the proceeds -- it is doing the very thing that copyright laws were designed to dissuade by giving authors a time-limited right to control the profits from reproducing copies of their work.

▲

trinsic2 a year ago | parent | prev [-]

Yea good point. whats the difference between spidering content and training a model? Its almost like access pages of contact like a search engine.. If the information is publically available?

▲

pera a year ago | parent | prev | next [-]

A product from a company is not a person. An LLM is not a brain.

If you transcode a CD to mp3 and build a business around selling these files without the author's permission you'd be in big legal problems.

Tech products that "accidentally" reproduce materials without the owners' permission (e.g. someone uploading La La Land into YouTube) have processes to remove them by simply filling a form. Can you do that with ChatGPT?

▲

lelanthran a year ago | parent | prev | next [-]

Because the law considers scale.

It's legal for you to possess a single joint. It's not legal for you to possess a warehouse of 400 tons of weed.

The line between legal and not legal is sometimes based on scale; being able to ingest a single book and learn from it is not the same scale as ingesting the entire published works of mankind and learning from it.

▲

krisoft a year ago | parent [-]

Are you describing what the law is or what you feel the law should be? Because those things are not always the same.

▲

lelanthran a year ago | parent [-]

> Are you describing what the law is or what you feel the law should be?

I am stating what is, right now.

I thought the weed example made that clear.

Let me clarify: the state of things, as they stand, is that the entire justice system, legislation and courts included, takes scale into account when looking at the line dividing "legal" from "illegal".

There is literally no defense of "If it is legal at qty x1, it is legal at any qty".

▲

krisoft a year ago | parent [-]

> I am stating what is, right now.

Excelent. Then the next question is where (in which jurisdiction) are you describing the law? And what are your sources? Not about the weed, i don’t care about that. Particularly the “being able to ingest a single book and learn from it is not the same scale as ingesting the entire published works of mankind and learning from it”.

The reason why i’m asking is because you are drawing a paralel between criminal law and (i guess?) copyright infringement. The drug posession limits in many jurisdictions are explicitly written into the law. These are not some grand principle of laws but the result of explicit legislative intent. The people writing the law wanted to punish drug peddlers without punishing end users. (Or they wanted to punish them less severly or differently.) Are the copyright limits you are thinking about similarly written down? Do you have case references one can read?

	▲	lelanthran a year ago \| parent [-]
		I made it clear in both my responses that scale matters, and that there is precedence in law, in almost all countries I can think off right now, for scale mattering. I did not make the point that there is a written law specifically for copyright violations at scale (although many jurisdictions do have exemptions at small scale written into law). I will try to clarify once again: there is no defence in law that because something is allowed at qty X1, it must be allowed at any qty. This is the defence that was originally posted that I replied to, it is the one that is not valid because courts regularly consider the scale of an activity when determining the line between allowed and not allowed.

▲

nkrisc a year ago | parent | prev | next [-]

Because AI isn’t a person.

▲

hiatus a year ago | parent | prev [-]

Is training an AI the same as a person learning something? You haven't shown that to be the case.

▲

chii a year ago | parent [-]

no i havent, but judging by the name - machine learning - i think it is the case.

	▲	yyuugg a year ago \| parent [-]
		Do you think starfish and jellyfish are fish? Judging by the name they are...

▲

criddell a year ago | parent | prev | next [-]

That might be the point. If your business model is built on reselling something you’ve built on stuff you’ve taken without payment or permission, maybe the business isn’t viable.

▲

asdff a year ago | parent | prev | next [-]

I wonder if they can say something like “we aren’t scraping your protected content, we are merely scraping this old model we don’t maintain anymore and it happened to have protected content in it from before the ruling” then you’ve essentially won all of humanities output, as you can already scrape the new primary information (scientific articles and other datasets designed for researchers to freely access) and whatever junk outputted by the content mills is just going to be a poor summarizations of that primary information.

Other factors that help this effort of an old model + new public facing data being complete, are the idea that other forms of media like storytelling and music have already converged onto certain prevailing patters. For stories we expect a certain style of plot development and complain when its missing or not as we expect. For music most anything being listened to is lyrics no one is deeply reading into put over the same old chord progressions we’ve always had. For art there are just too few of us who are actually going out of our way to get familiar with novel art vs the vast bulk of the worlds present day artistic effort which goes towards product advertisement, which once again follows certain patterns people have been publishing in psychological journals for decades now.

In a sense we’ve already put out enough data and made enough of our world formulaic to the point where I believe we’ve set up for a perfect singularity already in terms of what can be generated for the average person who looks at a screen today. And because of that I think even a lack of any new training on such content wouldn’t hurt openai at all.

▲

andyjohnson0 a year ago | parent [-]

> I wonder if they can say something like “we aren’t scraping your protected content, we are merely scraping this old model we don’t maintain anymore and it happened to have protected content in it from before the ruling”

I'm not a lawyer, but I know enough to be pretty confident that that wouldn't work. The law is about intent. Coming up with "one weird trick" to work-around a potential court ruling is unlikely to impress a judge.

	▲	trinsic2 a year ago \| parent [-]
		Im not quite familiar with the google book project, but isnt this similar? Im pretty sure google got away with scanning copyrighted books in 2015 [0] [0]: https://www.reuters.com/article/technology/google-book-scann...

▲

zozbot234 a year ago | parent | prev | next [-]

They might make it work by (1) having lots of public domain content, for the purpose of training their models on basic language use, and (2) preserving source/attribution metadata about what copyrighted content they do use, so that the models can surface this attribution to the user during inference. Even if the latter is not 100% foolproof, it might still be useful in most cases and show good faith intent.

▲

CaptainFever a year ago | parent [-]

The latter one is possible with RAG solutions like ChatGPT Search, which do already provide sources! :)

But for inference in general, I'm not sure it makes too much sense. Training data is not just about learning facts, but also (mainly?) about how language works, how people talk, etc. Which is kind of too fundamental to be attributed to, IMO. (Attribution: Humanity)

But who knows. Maybe it can be done for more fact-like stuff.

▲

noitpmeder a year ago | parent | next [-]

Or this point, I'm sure there is more than enough publically and feely usable content to "learn how language works". There is no need to hoover up private or license-unclear content if that is your goal.

	▲	CaptainFever a year ago \| parent [-]
		I would actually love it if that was true. It would reduce a lot of legal headaches for sure. But if that was true, why were previous GPT versions not as good at understanding language? I can only conclude that it's because that's not actually true. There's not enough digital public domain materials to train a LLM to understand language competently. Perhaps old texts in physical form, then? It'll cost a lot to digitize that, wouldn't it? And it wouldn't really be accessible to AI hobbyists. Unless the digitization is publicly funded or something. (A big part of this is also how insanely long copyright lasts (nearly a hundred years!) that keeps most of the Internet's material from being public domain in the first place, but I won't belabour that point here.) Edit: Fair enough, I can see your point. "Surely it is cheaper to digitize old texts or buy a license to Google Books than to potentially lose a court case? Either OpenAI really likes risking it to save a bit of money, or they really wanted facts not contained in old texts." And yeah, I guess that's true. I could say "but facts aren't copyrightable" (which was supported by the judge's decision from the TFA), but then that's a different debate about whether or not people should be able to own facts. Which does have some inroads (e.g. a right against being summarized because it removes the reason to read original news articles).

▲

TeMPOraL a year ago | parent | prev [-]

> Training data is not just about learning facts, but also (mainly?) about how language works, how people talk, etc.

All of that and more, all at the same time.

Attribution at inference level is bound to work more-less the same way as humans attribute things during conversations: "As ${attribution} said, ${some quote}", or "I remember reading about it in ${attribution-1} - ${some statements}; ... or maybe it was in ${attribution-2}?...". Such attributions are often wrong, as people hallucinate^Wmisremember where they saw or heard something.

RAG obviously can work for this, as well as other solutions involving retrieving, finding or confirming sources. That's just like when a human actually looks up the source when citing something - and has similar caveats and costs.

	▲	CaptainFever a year ago \| parent [-]
		That sounds about right. When I ask ChatGPT about "ought implies can" for example, it cites Kant.

▲

TeMPOraL a year ago | parent | prev [-]

Only half-serious, but: I wonder if they can dance with the publishers around this issue long enough for most of the contested text to become part of public court records, and then claim they're now training off that. <trollface>

	▲	jprete a year ago \| parent [-]
		Being part of a public court record doesn't seem like something that would invalidate copyright.

▲

A4ET8a8uTh0 a year ago | parent | prev | next [-]

Re-training can be done, but, and it is not a small but, models already do exist and can be used locally suggesting that the milk has been spilled for too long at this point. Separately, neutering them effectively lowers their value as opposed to their non-neutered counterparts.

▲

ashoeafoot a year ago | parent | prev [-]

What about bombing? You could always smuggle dmca content in training sets hoping for a payout?

	▲	Xelynega a year ago \| parent [-]
		The onus is on the person collecting massive amounts of data and circumventing DMCA protections to ensure they're not doing anything illegal. "well someone snuck in some DMCA content" when sharing family photos and doesn't suddenly make it legal to share that DMCA protected content with your photos...