Remix.run Logo
duskwuff 4 days ago

> A bot told me they offer downloads of the underlying WARC files but I could not find it

The "bot" is wrong. Most of the crawl data used by the Internet Archive, particularly the Alexa crawls, isn't publicly accessible. (This is because some of it includes archived pages which have since been suppressed by the site owner - removing those pages from the archived crawl data isn't practical.)

https://archive.org/details/alexacrawls

Common Crawl data is public, but less comprehensive than IA - https://commoncrawl.org/