Remix.run Logo
lyu07282 3 days ago

> This would mean there is an "official" source of all web data. LLM people can use snapshots of this

that already exists, its called CommonCrawl:

https://commoncrawl.org/

patrickhogan1 2 days ago | parent | next [-]

Common Crawl, while a massive dataset of the web does not represent the entirety of the web.

It’s smaller than Google’s index and Google does not represent the entirety of the web either.

For LLM training purposes this may or may not matter, since it does have a large amount of the web. It’s hard to prove scientifically whether the additional data would train a better model, because no one (afaik) not Google not common crawl not Facebook not Internet Archive have a copy that holds the entirety of the currently accessible web (let alone dead links). I’m often surprised using GoogleFu at how many pages I know exist even with famous authors that just don’t appear in googles index, common crawl or IA.

edg5000 2 days ago | parent | prev [-]

Cool! I will check it out