▲ | ccgreg 7 days ago | |||||||||||||||||||||||||||||||||||||||||||
The team that runs the Common Crawl Foundation is well aware of how to crawl and index the web in real time. It's expensive, and it's not our mission. There are multiple companies that are using our crawl data and our web graph metadata to build up-to-date indexes of the web. | ||||||||||||||||||||||||||||||||||||||||||||
▲ | noosphr 7 days ago | parent | next [-] | |||||||||||||||||||||||||||||||||||||||||||
Yes, I've used your data myself on a number of occasions. But you are pretty much the only people who can save the web from AI bots right now. The sites I administer are drowning in bots, and the applications I build which need web data are constantly blocked. We're in the worst of all possible worlds and the simplest way to solve it is to have a middleman that scrapes gently and has the bandwidth to provide an AI first API. | ||||||||||||||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||||||||||||||
▲ | nickpsecurity 6 days ago | parent | prev [-] | |||||||||||||||||||||||||||||||||||||||||||
Your terms and conditions include a lot of restrictions with some ambiguous in how they can be interpreted. Would Common Crawl do a "for all purposes and no restrictions" license if it is for AI training, comouter analyses, etc? Especially given the bad actors are ignoring copyrights and terms while such restrictions only affect moral, law-abiding people? Also, even simpler, would Common Crawl release under a permissive license a list of URL's that others could scrape themselves? Maybe with metadata per URL from your crawls, such as which use Cloudflare or other limiters. Being able to rescrape the CC index independently would be very helpful under some legal theories about AI training. Independent, search operators benefit, too. | ||||||||||||||||||||||||||||||||||||||||||||
|