▲ | andai 7 hours ago | |
Have you ever looked at Common Crawl dumps? I did a bit of data mining and holy cow is 99.99% of the web crap. Spam, porn, ads, flame wars, random blogs by angsty teens... I understand it has historical and cultural value — and maybe literary value, in a Douglas Coupland kind of way — but for my purposes, there was very little here that I considered of interest. Which was very encouraging to me, because it implies that indexing the Actually Important Web Pages might even be possible for a single person on their laptop. Wikipedia, for comparison, is only ~20GB compressed. (And even most of that is not relevant to my interests, e.g. the Wikipedia articles related to stuff I'd ever ask about are probably ~200MB tops.) |