Remix.run Logo
patrickhogan1 2 days ago

Common Crawl, while a massive dataset of the web does not represent the entirety of the web.

It’s smaller than Google’s index and Google does not represent the entirety of the web either.

For LLM training purposes this may or may not matter, since it does have a large amount of the web. It’s hard to prove scientifically whether the additional data would train a better model, because no one (afaik) not Google not common crawl not Facebook not Internet Archive have a copy that holds the entirety of the currently accessible web (let alone dead links). I’m often surprised using GoogleFu at how many pages I know exist even with famous authors that just don’t appear in googles index, common crawl or IA.

schoen 2 days ago | parent [-]

Is there any way to find patterns in what doesn't make it into Common Crawl, and perhaps help them become more comprehensive?

Hopefully it's not people intentionally allowing the Google crawler and intentionally excluding Common Crawl with robots.txt?