| ▲ | jsheard 2 hours ago | ||||||||||||||||||||||
> I figured that they have found an (automated) way to imitate Googlebot really well. If a site (or the WAF in front of it) knows what it's doing then you'll never be able to pass as Googlebot, period, because the canonical verification method is a DNS lookup dance which can only succeed if the request came from one of Googlebots dedicated IP addresses. Bingbot is the same. | |||||||||||||||||||||||
| ▲ | xurukefi 2 hours ago | parent [-] | ||||||||||||||||||||||
There are ways to work around this. I've just tested this: I've used the URL inspection tool of Google Search Console to fetch a URL from my website, which I've configured to redirect to a paywalled news article. Turns out the crawler follows that redirect and gives me the full source code of the redirected web site, without any paywall. That's maybe a bit insane to automate at the scale of archive.today, but I figure they do something along the lines of this. It's a perfect imitation of Googlebot because it is literally Googlebot. | |||||||||||||||||||||||
| |||||||||||||||||||||||