| ▲ | rvz 8 hours ago | |
> Among the works we hold copyrights over is Sam Williams and Richard Stallman's Free as in freedom: Richard Stallman's crusade for free software, which was found in datasets used by Anthropic as training inputs for their LLMs. This is the reason why AI companies won't let anyone inspect which content was in the training set. It turns out the suspicions from many copyright holders (including the FSF) was true (of course). Anthropic and others will never admit it, hence why they wanted to settle and not risk going to trial. AI boosters obviously will continue to gaslight copyright holders to believe nonsense like: "It only scraped the links, so AI didn't directly train on your content!", or "AI can't see like humans, it only see numbers, binary or digits" or "AI didn't reproduce exactly 100% of the content just like humans do when tracing from memory!". They will not share the data-set used to train Claude, even if it was trained on AGPLv3 code. | ||
| ▲ | impossiblefork 5 hours ago | parent | next [-] | |
There's already legal requirements in the EU that you must publish what goes into your training set. This information must apparently be publshed before the august 2 next year. | ||
| ▲ | zelphirkalt 6 hours ago | parent | prev [-] | |
They simply have way too much incentive to train on anything they can get their hands on. They are driving businesses, that are billions in losses so far. Someone somewhere is probably being told to feed the monster anything they can get, and not to document it, threatened with an NDA and personal financial ruin, if the proof of it ever came out. Opaque processes acting as a shield, like they do in so many other businesses. | ||