Remix.run Logo
leobg 6 days ago

This was a great write-up!

Didn't you run into Cloudflare blocks? Many sites are using things like browser fingerprinting. I'd imagine this would be an issue with news sites particularly, as many of them will show the full content only to Google Bot, but not anyone else. Which I have long thought of as an underappreciated moat that Google has in the search market. I was surprised that this topic wasn't mentioned at all in your article. Was it not an issue, or did you just prefer to leave it out?

And you also mentioned nothing about URL de-duplication. Things like "trailing slash or no trailing slash", "query params or no query params", "www or no www". Did you have your crawlers just follow all URLs as they encountered them, and handled duplication only on the content level (e.g. using trigrams)? It sound like that would be wasteful, as you might end up making requests to potentially 2x or more the number of URLs that you'd need to.

Thanks.