▲ | nickpsecurity 6 days ago | ||||||||||||||||||||||||||||||||||
Your terms and conditions include a lot of restrictions with some ambiguous in how they can be interpreted. Would Common Crawl do a "for all purposes and no restrictions" license if it is for AI training, comouter analyses, etc? Especially given the bad actors are ignoring copyrights and terms while such restrictions only affect moral, law-abiding people? Also, even simpler, would Common Crawl release under a permissive license a list of URL's that others could scrape themselves? Maybe with metadata per URL from your crawls, such as which use Cloudflare or other limiters. Being able to rescrape the CC index independently would be very helpful under some legal theories about AI training. Independent, search operators benefit, too. | |||||||||||||||||||||||||||||||||||
▲ | ccgreg 6 days ago | parent [-] | ||||||||||||||||||||||||||||||||||
Common Crawl doesn't own the content in its crawl, so no, our terms of use do not grant anyone permission to ignore the actual content owner's license. We carefully preserve robots.txt permissions in robots.txt, in http headers, and in html meta tags. We do publish 2 different url indexes, if you wanted to recrawl for some reason. | |||||||||||||||||||||||||||||||||||
|