But all content is DMCA protected. Avoiding copyrighted content means not having content as all material is automatically copyrighted. One would be limited to licensed content, which is another minefield.

The apparant loophole is between copyrighted work and copyrighted work that is also registered. But registration can occur at any time, meaning there is little practical difference. Unless you have perfect licenses for all your training data, which nobody does, you have to accept the risk of copyright suits.

▲

Xelynega 2 years ago | parent | next [-]

Yes, that's how every other industry that redistributes content works.

You have to license content you want to use, you cant just use it for free because it's on the internet.

Netflix doesn't just start hosting shows and hope they don't get a copyright suit...

▲

YetAnotherNick 2 years ago | parent [-]

In almost all cases before gen AI, scraping was found to be legal unless the bot accepted terms of service, in which case bot is bound by ToS. The biggest and most clear is [1]. People have been scraping internet for as long as internet existed.

[1]: https://en.wikipedia.org/wiki/HiQ_Labs_v._LinkedIn

	▲	account42 2 years ago \| parent [-]
		Before gen AI, scraping mostly wasn't about copyrightable data but about finding facts. Scraping doesn't magically make copyright infringement legal.

▲

noitpmeder 2 years ago | parent | prev [-]

It's insane to me that people don't agree that you need to require a license to train your proprietary for-profit model on someone else's work.