| Interesting. Two key quotes: > It is unclear if the Intercept ruling will embolden other publications to consider DMCA litigation; few publications have followed in their footsteps so far. As time goes on, there is concern that new suits against OpenAI would be vulnerable to statute of limitations restrictions, particularly if news publishers want to cite the training data sets underlying ChatGPT. But the ruling is one signal that Loevy & Loevy is narrowing in on a specific DMCA claim that can actually stand up in court. > Like The Intercept, Raw Story and AlterNet are asking for $2,500 in damages for each instance that OpenAI allegedly removed DMCA-protected information in its training data sets. If damages are calculated based on each individual article allegedly used to train ChatGPT, it could quickly balloon to tens of thousands of violations. Tens of thousands of violations at $2500 each would amount to tens of millions of dollars in damages. I am not familiar with this field, does anyone have a sense of whether the total cost of retraining (without these alleged DMCA violations) might compare to these damages? |
| |
| ▲ | jsheard 7 hours ago | parent | next [-] | | It would make sense from a legal standpoint, but I don't think they could do that without massively regressing their models performance to the point that it would jeopardize their viability as a company. | | |
| ▲ | Xelynega 7 hours ago | parent | next [-] | | I agree, just want to make sure "they can't stop doing illegal things or they wouldn't be a success" is said out loud instead of left to subtext. | | |
| ▲ | CuriouslyC 6 hours ago | parent [-] | | They can't stop doing things some people don't like (people who also won't stop doing things other people don't like). The legality of the claims is questionable which is why most are getting thrown out, but we'll see if this narrow approach works out. I'm sure there are also a number of easy technical ways to "include" the metadata while mostly ignoring it during training that would skirt the letter of the law if needed. | | |
| ▲ | Xelynega 4 hours ago | parent [-] | | If we really want to be technical, in common law systems anything is legal as long as the highest court to challenge it decides it's legal. I guess I should have used the phrase "common sense stealing in any other context" to be more precise? |
|
| |
| ▲ | asdff 7 hours ago | parent | prev | next [-] | | I wonder if they can say something like “we aren’t scraping your protected content, we are merely scraping this old model we don’t maintain anymore and it happened to have protected content in it from before the ruling” then you’ve essentially won all of humanities output, as you can already scrape the new primary information (scientific articles and other datasets designed for researchers to freely access) and whatever junk outputted by the content mills is just going to be a poor summarizations of that primary information. Other factors that help this effort of an old model + new public facing data being complete, are the idea that other forms of media like storytelling and music have already converged onto certain prevailing patters. For stories we expect a certain style of plot development and complain when its missing or not as we expect. For music most anything being listened to is lyrics no one is deeply reading into put over the same old chord progressions we’ve always had. For art there are just too few of us who are actually going out of our way to get familiar with novel art vs the vast bulk of the worlds present day artistic effort which goes towards product advertisement, which once again follows certain patterns people have been publishing in psychological journals for decades now. In a sense we’ve already put out enough data and made enough of our world formulaic to the point where I believe we’ve set up for a perfect singularity already in terms of what can be generated for the average person who looks at a screen today. And because of that I think even a lack of any new training on such content wouldn’t hurt openai at all. | |
| ▲ | zozbot234 7 hours ago | parent | prev | next [-] | | They might make it work by (1) having lots of public domain content, for the purpose of training their models on basic language use, and (2) preserving source/attribution metadata about what copyrighted content they do use, so that the models can surface this attribution to the user during inference. Even if the latter is not 100% foolproof, it might still be useful in most cases and show good faith intent. | | |
| ▲ | CaptainFever 7 hours ago | parent [-] | | The latter one is possible with RAG solutions like ChatGPT Search, which do already provide sources! :) But for inference in general, I'm not sure it makes too much sense. Training data is not just about learning facts, but also (mainly?) about how language works, how people talk, etc. Which is kind of too fundamental to be attributed to, IMO. (Attribution: Humanity) But who knows. Maybe it can be done for more fact-like stuff. | | |
| ▲ | TeMPOraL 6 hours ago | parent [-] | | > Training data is not just about learning facts, but also (mainly?) about how language works, how people talk, etc. All of that and more, all at the same time. Attribution at inference level is bound to work more-less the same way as humans attribute things during conversations: "As ${attribution} said, ${some quote}", or "I remember reading about it in ${attribution-1} - ${some statements}; ... or maybe it was in ${attribution-2}?...". Such attributions are often wrong, as people hallucinate^Wmisremember where they saw or heard something. RAG obviously can work for this, as well as other solutions involving retrieving, finding or confirming sources. That's just like when a human actually looks up the source when citing something - and has similar caveats and costs. |
|
| |
| ▲ | TeMPOraL 6 hours ago | parent | prev [-] | | Only half-serious, but: I wonder if they can dance with the publishers around this issue long enough for most of the contested text to become part of public court records, and then claim they're now training off that. <trollface> | | |
| ▲ | jprete 5 hours ago | parent [-] | | Being part of a public court record doesn't seem like something that would invalidate copyright. |
|
| |
| ▲ | ashoeafoot 7 hours ago | parent | prev | next [-] | | What about bombing? You could always smuggle dmca content in training sets hoping for a payout? | | |
| ▲ | Xelynega 7 hours ago | parent [-] | | The onus is on the person collecting massive amounts of data and circumventing DMCA protections to ensure they're not doing anything illegal. "well someone snuck in some DMCA content" when sharing family photos and doesn't suddenly make it legal to share that DMCA protected content with your photos... |
| |
| ▲ | sandworm101 7 hours ago | parent | prev | next [-] | | But all content is DMCA protected. Avoiding copyrighted content means not having content as all material is automatically copyrighted. One would be limited to licensed content, which is another minefield. The apparant loophole is between copyrighted work and copyrighted work that is also registered. But registration can occur at any time, meaning there is little practical difference. Unless you have perfect licenses for all your training data, which nobody does, you have to accept the risk of copyright suits. | | |
| ▲ | Xelynega 4 hours ago | parent [-] | | Yes, that's how every other industry that redistributes content works. You have to license content you want to use, you cant just use it for free because it's on the internet. Netflix doesn't just start hosting shows and hope they don't get a copyright suit... |
| |
| ▲ | A4ET8a8uTh0 7 hours ago | parent | prev [-] | | Re-training can be done, but, and it is not a small but, models already do exist and can be used locally suggesting that the milk has been spilled for too long at this point. Separately, neutering them effectively lowers their value as opposed to their non-neutered counterparts. |
|