| ▲ | edg5000 3 days ago |
| Can we build a blockchain/P2P-based web crawler that can create snapshots of the entire web with high integrity (peer verification)? The already-crawled pages would be exchanged through bulk transfer between peers. This would mean there is an "official" source of all web data. LLM people can use snapshots of this. This would hopefully reduce the amount of ill-behaved crawlers, so we will see less draconian anti-bot measures over time on websites, in turn making it easier to crawl. Does something like this exist? It would be so awesome. It would also allow people to run a search engine at home. |
|
| ▲ | lyu07282 3 days ago | parent | next [-] |
| > This would mean there is an "official" source of all web data. LLM people can use snapshots of this that already exists, its called CommonCrawl: https://commoncrawl.org/ |
| |
| ▲ | patrickhogan1 2 days ago | parent | next [-] | | Common Crawl, while a massive dataset of the web does not represent the entirety of the web. It’s smaller than Google’s index and Google does not represent the entirety of the web either. For LLM training purposes this may or may not matter, since it does have a large amount of the web. It’s hard to prove scientifically whether the additional data would train a better model, because no one (afaik) not Google not common crawl not Facebook not Internet Archive have a copy that holds the entirety of the currently accessible web (let alone dead links). I’m often surprised using GoogleFu at how many pages I know exist even with famous authors that just don’t appear in googles index, common crawl or IA. | | |
| ▲ | schoen 2 days ago | parent [-] | | Is there any way to find patterns in what doesn't make it into Common Crawl, and perhaps help them become more comprehensive? Hopefully it's not people intentionally allowing the Google crawler and intentionally excluding Common Crawl with robots.txt? |
| |
| ▲ | edg5000 2 days ago | parent | prev [-] | | Cool! I will check it out |
|
|
| ▲ | bayindirh 3 days ago | parent | prev [-] |
| Why would I spend time and resources to feed a machine which wastes more resources to hallucinate fiction from data it ingested? For digital preservation? We may discuss. For an LLM? Haha, no. No, thank you. |