| ▲ | pards 7 hours ago | |||||||
>> But because it can also be used to bypass paywalls > How? Does the site pay for subscription for every newspaper? Someone with a subscription logs into the site, then archives it. Archive.is uses the current user's session and can therefore see the paywalled content. | ||||||||
| ▲ | mojosam 6 hours ago | parent | next [-] | |||||||
> Someone with a subscription logs into the site, then archives it. That’s not the case. I don’t have a NYT subscription, I just Googled for an old obscure article from 1989 on pork bellies I thought would be unlikely for archive.today to have cached, and sure enough when I asked to retrieve that article, it didn’t have it and began the caching process. A few minutes later, it came up with the webpage, which if you visit on archive.is, you can see it was first cached just a few minutes ago. https://www.nytimes.com/1989/11/01/business/futures-options-... My assumption has been that the NYT is letting them around the paywall, much like the unrelated Wayback Machine. How else could this be working? Only way I could think it could work is that either they have access to a NYT account and are caching using that — something I suspect the NYT would notice and shutdown — or there is a documented hole in the paywall they are exploiting (but not the Wayback Machine, since the caching process shows they are pulling direct from the NYT). | ||||||||
| ▲ | codedokode 7 hours ago | parent | prev | next [-] | |||||||
Do they have such an option? I don't see it on the site, and the browser extension seems to send only the URL [1] to the server. Can you provide more information? [1] https://github.com/JNavas2/Archive-Page/blob/main/Firefox/ba... | ||||||||
| ▲ | madeforhnyo 5 hours ago | parent | prev | next [-] | |||||||
I believe news sites let crawlers access the full articles for a short period of time, so that they appear in search results. Archive.is crawls during that short window. | ||||||||
| ▲ | rkagerer 7 hours ago | parent | prev [-] | |||||||
Does it still leak your IP, e.g. if the page rendered by the site you're archiving includes it? You'd think they'd create a simple filter to redact that out. | ||||||||
| ||||||||