Remix.run Logo
rpearcea 11 hours ago

Interesting, but 1.3 million pages is somewhat limited. They seem to have done a good job indexing Wikipedia. I'm curious, why not scan the full ipv4 address space and index the main page of every website you find?

Tiberium 11 hours ago | parent | next [-]

You won't be able to scan most of websites this way because most servers expect you to also pass a valid hostname. However you can use domain lists, for example https://purecrawl.com/en/download/domains (or https://domains-monitor.com/ which is paid but has more domains) as an initial seed shouldn't be too bad, but you'll have to ingest terabytes of spammy/low quality content.

Brybry 11 hours ago | parent [-]

wouldn't certificate transparency logs be a good way to collect most active domains?

Tiberium 9 hours ago | parent [-]

Yeah, I forgot about these :)

KomoD 11 hours ago | parent | prev [-]

For being such a small index, it's sooo slow to search

slater 11 hours ago | parent [-]

Probably experiencing the HN hug