Remix.run Logo
xurukefi 2 hours ago

Kinda off-topic, but has anyone figured out how archive.today manages to bypass paywalls so reliably? I've seen people claiming that they have a bunch of paid accounts that they use to fetch the pages, which is, of course, ridiculous. I figured that they have found an (automated) way to imitate Googlebot really well.

jsheard 2 hours ago | parent | next [-]

> I figured that they have found an (automated) way to imitate Googlebot really well.

If a site (or the WAF in front of it) knows what it's doing then you'll never be able to pass as Googlebot, period, because the canonical verification method is a DNS lookup dance which can only succeed if the request came from one of Googlebots dedicated IP addresses. Bingbot is the same.

xurukefi 2 hours ago | parent [-]

There are ways to work around this. I've just tested this: I've used the URL inspection tool of Google Search Console to fetch a URL from my website, which I've configured to redirect to a paywalled news article. Turns out the crawler follows that redirect and gives me the full source code of the redirected web site, without any paywall.

That's maybe a bit insane to automate at the scale of archive.today, but I figure they do something along the lines of this. It's a perfect imitation of Googlebot because it is literally Googlebot.

jsheard 2 hours ago | parent | next [-]

I'd file that under "doesn't know what they're doing" because the search console uses a totally different user-agent (Google-InspectionTool) and the site is blindly treating it the same as Googlebot :P

Presumably they are just matching on *Google* and calling it a day.

xurukefi an hour ago | parent [-]

Sure, but maybe there are other ways to control Googlebot in a similar fashion. Maybe even with a pristine looking User-Agent header.

Aurornis an hour ago | parent | prev [-]

> which I've configured to redirect to a paywalled news article.

Which specific site with a paywall?

elzbardico 2 hours ago | parent | prev | next [-]

> which is, of course, ridiculous.

Why? in the world of web scrapping this is pretty common.

xurukefi 2 hours ago | parent [-]

Because it works too reliably. Imagine what that would entail. Managing thousands of accounts. You would need to ensure to strip the account details form archived peages perfectly. Every time the website changes its code even slightly you are at risk of losing one of your accounts. It would constantly break and would be an absolute nightmare to maintain. I've personally never encountered such a failure on a paywalled news article. archive.today managed to give me a non-paywalled clean version every single time.

Maybe they use accounts for some special sites. But there is definetly some automated generic magic happening that manages to bypass paywalls of news outlets. Probably something Googlebot related, because those websites usually give Google their news pages without a paywall, probably for SEO reasons.

mikkupikku 2 hours ago | parent [-]

Using two or more accounts could help you automatically strip account details.

xurukefi 2 hours ago | parent [-]

That's actually a really neat idea.

Aurornis 2 hours ago | parent | prev | next [-]

> I've seen people claiming that they have a bunch of paid accounts that they use to fetch the pages, which is, of course, ridiculous.

The curious part is that they allow web scraping arbitrary pages on demand. So if a publisher could put in a lot of arbitrary requests to archive their own pages and see them all coming from a single account or small subset of accounts.

I hope they haven't been stealing cookies from actual users through a botnet or something.

xurukefi 2 hours ago | parent [-]

Exactly. If I was an admin of a popular news website I would try to archive some articles and look at the access logs in the backend. This cannot be too hard to figure out.

tonymet 2 hours ago | parent | prev | next [-]

I’m an outsider with experience building crawlers. You can get pretty far with residential proxies and browser fingerprint optimization. Most of the b-tier publishers use RBC and heuristics that can be “worked around” with moderate effort.

quietsegfault 2 hours ago | parent [-]

.. but what about subscription only, paywalled sources?

tonymet an hour ago | parent [-]

many publisher's offer "first one's free".

For those that don't , I would guess archive.today is using malware to piggyback off of subscriptions.

layer8 2 hours ago | parent | prev [-]

It’s not reliable, in the sense that there are many paywalled sites that it’s unable to archive.

xurukefi 2 hours ago | parent [-]

But it is reliable in the sense that if it works for a site, then it usually never fails.