Remix.run Logo
direwolf20 5 hours ago

Unfortunately this is the bulk of search engine work. Recursive scraping is easy in comparison, even with CAPTCHA bypassing. You either limit the index to only highly relevant sites (as Marginalia does) or you must work very hard to separate the spam from the ham. And spam in one search may be ham in another.

saltysalt 5 hours ago | parent [-]

I limit it to highly relevant curated seed sites, and don't allow public submissions. I'd rather have a small high-quality index.

You are absolutely right, it is the hardest part!