Remix.run Logo
scotty79 7 days ago

How do you even build a search index today when websites barely link to each other?

Nowadays the bulk of linking goes to ecommerce sites (amazon) from content farms (reddit) and all those sites are submitted directly to Google. I don't think crawlable internet exists anymore.

NitpickLawyer 7 days ago | parent | next [-]

> How do you even build a search index today

You can start with seeds like common crawl, and go from there. You can also get DNS records from various providers. Then there's SSL cert logs that you can crawl. Plenty of sources, if you have funding (search by itself without ads sponsoring it might be a net loss, except some niche uses like kagi?)

loa_in_ 7 days ago | parent | prev | next [-]

It isn't impossible nowadays to enumerate domain names using DNS data and score them based on the content they serve. Isn't that what we really want as users? Scoring based not on proxies for relevance like referral count, but on viewable content?

guillem_lefait 7 days ago | parent [-]

It's possible indeed as I'm doing it for another reason (monitoring sovereigty).

You can ask Icann [0] access to gTld domain list files (if you have a legitimate reason to do so). Once access you are granted access to a gTld, you can download a compressed csv file with a line per couple <domain, nameserver>.

[0] https://czds.icann.org/home

gostsamo 7 days ago | parent | prev [-]

You are making a broad generalization and even it is based on the assumption that the page-ranking algorithm is the only possible way to do it.

scotty79 7 days ago | parent [-]

I make no assumption beyond the one that if you want to index a page you need to know its address and if the author of the website is not going to give it to you because you are not Google and no known website links to it, then you have no way of finding it out. You can't build anything independently. You have to go to Google.