Remix.run Logo
theplumber 10 hours ago

Can you keep a straight face when you say IP theft while OpenAI and Claude have their entire business based on IP theft?

hereme888 7 hours ago | parent | next [-]

I believe OP is talking at the national wealth and technology level: China stole from the U.S. (again). So the U.S. moves to protect American companies.

9 hours ago | parent | prev | next [-]
[deleted]
Levitz 6 hours ago | parent | prev | next [-]

Yes. For all the concerns about IP theft there can be on OpenAI or Claude, there's not even concern when it comes to Chinese companies since it's fully expected that it's a lost cause. Has been for decades.

adamtaylor_13 8 hours ago | parent | prev [-]

This is a commonly-repeated trope. Full of all the emotional zeal of AI Doomerism, but no accompanying evidence.

NietzscheanNull 8 hours ago | parent | next [-]

How about the $1.5 billion settlement Anthropic agreed to pay authors and publishers:

https://www.nytimes.com/2025/09/05/technology/anthropic-sett...

Several consolidated cases against OpenAI:

https://www.bakerlaw.com/in-re-openai-inc-copyright-infringe...

And these plaintiffs are representative of only the best-organized and most well-funded of those who believe that these companies stole their data. Countless independent writers, artists, and other individuals whose data was ingested unknowingly and without consent lack the resources to litigate claims, but that doesn't change the fact that their copyright was violated in service of for-profit LLM/GenAI model training. It's not a trope, it's just what happened.

tick_tock_tick 5 hours ago | parent [-]

I'm not sure I understand? That case says training was explicitly legal perfectly validating their whole business model.

The money they paid was for pirating books rather then buying them.

fer 8 hours ago | parent | prev | next [-]

No evidence?

>The court drew a line, however, when it came to the pirated books, which were downloaded without payment and kept in Anthropic’s library irrespective of whether they were used to train its LLMs.

https://www.loeb.com/en/insights/publications/2025/07/bartz-...

>We apply a basic prompt template to bypass the refusal training and show that OpenAI models are currently less prone to memorization elicitation than models from Meta, Mistral, and Anthropic. We find that as models increase in size, especially beyond 100 billion parameters, they demonstrate significantly greater capacity for memorization.

https://arxiv.org/abs/2412.06370

> They further rely on safety alignment strategies via RLHF, system prompts, and output filters to block verbatim regurgitation of copyrighted works, and have cited the efficacy of these measures in their legal defenses against copyright infringement claims. We show that finetuning bypasses these protections: by training models to expand plot summaries into full text, a task naturally suited for commercial writing assistants, we cause GPT-4o, Gemini-2.5-Pro, and DeepSeek-V3.1 to reproduce up to 85-90% of held-out copyrighted books, with single verbatim spans exceeding 460 words, using only semantic descriptions as prompts and no actual book text

https://arxiv.org/abs/2603.20957

Even if they're trained for refusal and rewording, the data is still there in the weights.

One blog post I have, which was basically the only source for a while, explaining how to boot Armbian in an obscure SBC only meant for Android, was repeated verbatim until they started they improving the rewording.

user43928 8 hours ago | parent | prev | next [-]

I don't mind in the slightest that AI labs have used any public data they could get their hands on to train their models.

This includes books, the internet, or other AI models. It's all the same to me.

I find it hypocritical when AI labs complain about their models being used for training.

matheusmoreira 6 hours ago | parent [-]

I agree with you.

I also find it hypocritical when the copyright industry fails to put any effort into prosecuting these big techs for their so called infringements.

It's like the industry is a shadow of its former self. The way the copyright industry used to operate, one would think these big tech CEOs would wake up with SWAT pointing guns at them while their electronics are seized, and then they'd end up in court and get hit with something ridiculous like a quadrillion dollar fine.

It actually pisses me off that it's not happening. Not because I care about copyright, but because it's extremely disrespectful towards all the previous victims of the copyright industry.

gg80 7 hours ago | parent | prev [-]

Mine is anecdotal evidence at best: I co-authored a fairly obscure book about the application of category theory to an extremely niche subject. There's basically no mention of the stuff in the book anywhere on the internet, nor in any academic publication I'm aware of. If you want to have an idea about what's in the book you have to have access to it. I couldn't remember some details of it and being lazy and slightly curious I tried asking a couple of models (one by OpenAI and one by Google): they both managed to give me extremely detailed answers based on the contents of the book. Nobody has ever asked me or any other person involved in the publication for permission to use the book in any kind of training (they may have bought the book but not the rights to reproduce it).

The funny thing is what happened when I told one of the models (the Google one) I was one of the authors and that I had never given any consent to use the book for its training and that given that it was so willing to provide any user with the contents of the book nobody would have had any reason to buy the book. The thing told me that it had done it just because I was the author of the book (apparently me asking it about the content of an obscure academic book was sufficient to make it statistically plausible that I was one of the two people who had read the book, me and my co-author, excluding the editor a priori). It swore it would have never given that information to any other user.

I doubt that anyone could ever deny that LLMs are incredible tools that have incredible value. But denying that they have being made possible only thanks to egregious acts of piracy is disingenuous.