Remix.run Logo
echelon 3 days ago

> Kinda surprised they didn’t run into model collapse problems,

This is just model distillation.

Anyone with the expertise to build a model from scratch (which DeepSeek certainly can) can do this in a careful manner.

> but they stole their training data from other people who stole their training data from data collections that arguably stole them from content creators.

Bingo.

I have no problem with pirates pirating other pirates.

Screw OpenAI and Anthropic closed source models built from public data. The law should be that weights trained from non-owned sources should be public domain, or that any copyright holder can sue them and demand model takedown.

Google and Meta are probably the only two AI companies that have a right to license massive amounts of training data from social media and user file uploads given that their ToSes grant them these rights. But even Meta is pirating stuff.

Even if OpenAI and Anthropic continue pirating training data and keeping the results closed, China's open source strategy will win out in the end. It erodes the crust of value that is carefully guarded by the American giants. Everyone else will be integrating open models and hacking them apart, splicing them in new ways.

rkagerer 3 days ago | parent [-]

Google and Meta are probably the only two companies that have a right to license their training data

For the sake of someone unfamiliar... Why is that?

Did they pay teams of monkeys to generate their own, novel training data? Or gain explicit, opt-in permission from users who entrust them with their files/content?

noboostforyou 3 days ago | parent | next [-]

I'm pretty sure Meta stole a bunch of content for training by torrenting it - https://www.pcgamer.com/gaming-industry/court-documents-show...

echelon 3 days ago | parent | prev | next [-]

> For the sake of someone unfamiliar... Why is that?

I edited my comment, but basically they both own massive social media properties (YouTube, Instagram, Facebook) or file upload sites (Google Drive, Google Photos, Gmail) and their ToSes grant them these rights. You accept these terms when you use their services.

That's not great, but we are getting free services. It's in the terms.

It's a whole lot better than just scraping without permission, compensation, acknowledgement, or even notice.

To be clear, I have no problem with these models being built. But if they "steal" the data, the resultant model shouldn't be owned by anyone. It should be public domain and not allowed to be kept as a trade secret.

And it's funny that Anthropic is trying to depress our wages by training on our code. Again - I'm fine with that - I want to work faster, and I like these models and their capabilities. But Anthropic shouldn't be able to own the models they train off of us exclusively since they didn't license or buy our data. They provided us with nothing at all.

PunchyHamster 3 days ago | parent | next [-]

Facebook stole copyrighted material well above their own and admitted to it. It's not just "we took our users data", its "we literally downloaded torrent with 81 terabytes of books and used that for training".

Google most likely did something similar, just using books they already had indexed in Google Books, and probably by still seriously violating any reasonable notion of copyright

rkagerer 3 days ago | parent | prev [-]

You accept these terms when you use their services.

I certainly didn't*. I'd love to see litigation testing just how solid those insidious opt-in-by-default schemes are as a basis for "ownership".

If they had users explicitly opt-in with a "Yes, go ahead and train on my stuff and by the way I assert that I have all the rights to grant you the same", I'd have no problem with that, and they'd have a much stronger claim.

(*Before others inevitably disagree: I do opt-out of this stuff aggressively, and further send notice to companies from time to time that I don't agree to certain objectionable clauses of their ToS and they're welcome to close my account).

ahtihn 3 days ago | parent [-]

> and further send notice to companies from time to time that I don't agree to certain objectionable clauses of their ToS and they're welcome to close my account

And then you stopped using their service right?

rkagerer 3 days ago | parent [-]

Sometimes, if they said tough luck.

Other times they turn a blind eye and choose to provide the service (and collect my money) despite the lack of agreement to some part of their standard terms and their tacit acknowledgement that I didn't accept them. On two occasions their legal team responded and said "that's fine", and once they actually fixed their ToS.

People who didn't grow up dealing with paper contracts where you could easily redline and send back for countersigning don't seem to understand that you don't just need to blindly say "yes" to everything a company tries to foist upon you.

ffsm8 3 days ago | parent | prev [-]

The explicit opt in is only necessary under gdpr, which is a lot of data, but not a majority.

rkagerer 3 days ago | parent [-]

only necessary under gdpr

It's not that simple. The EU may be the only ones to have codified that, but there's centuries of case law in other jurisdictions dealing with ownership, that once the matter hits litigation might turn out to say something other than these tech companies would like.