| ▲ | echelon 3 days ago | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
> Kinda surprised they didn’t run into model collapse problems, This is just model distillation. Anyone with the expertise to build a model from scratch (which DeepSeek certainly can) can do this in a careful manner. > but they stole their training data from other people who stole their training data from data collections that arguably stole them from content creators. Bingo. I have no problem with pirates pirating other pirates. Screw OpenAI and Anthropic closed source models built from public data. The law should be that weights trained from non-owned sources should be public domain, or that any copyright holder can sue them and demand model takedown. Google and Meta are probably the only two AI companies that have a right to license massive amounts of training data from social media and user file uploads given that their ToSes grant them these rights. But even Meta is pirating stuff. Even if OpenAI and Anthropic continue pirating training data and keeping the results closed, China's open source strategy will win out in the end. It erodes the crust of value that is carefully guarded by the American giants. Everyone else will be integrating open models and hacking them apart, splicing them in new ways. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ▲ | rkagerer 3 days ago | parent [-] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Google and Meta are probably the only two companies that have a right to license their training data For the sake of someone unfamiliar... Why is that? Did they pay teams of monkeys to generate their own, novel training data? Or gain explicit, opt-in permission from users who entrust them with their files/content? | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||