| ▲ | edg5000 10 hours ago | |||||||||||||||||||||||||||||||||||||||||||
Residential proxies are the only way to crawl and scrape. It's ironic for this article to come from the biggest scraping company that ever existed! If you crawl at 1Hz per crawled IP, no reasonable server would suffer from this. It's the few bad apples (impatient people who don't rate limit) who ruin the internet for both users and hosters alike. And then there's Google. | ||||||||||||||||||||||||||||||||||||||||||||
| ▲ | mrweasel 2 hours ago | parent | next [-] | |||||||||||||||||||||||||||||||||||||||||||
First of: Google has not once crashed one of our sites with GoogleBot. They have never tried to by-pass our caching and they are open and honest about their IP ranges, allowing us to rate-limit if needed. The residential proxies are not needed, if you behave. My take is that you want to scrape stuff that site owners do not want to give you and you don't want to be told no or perhaps pay a license. That is the only case where I can see you needing a residential proxies. | ||||||||||||||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||||||||||||||
| ▲ | Ronsenshi 8 hours ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||||||||||||||
One thing about Google is that many anti-scraping services explicitly allow access to Google and maybe couple of other search engines. Everybody else gets to enjoy CloudFlare captcha, even when doing crawling at reasonable speeds. Rules For Thee but Not for Me | ||||||||||||||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||||||||||||||
| ▲ | toofy an hour ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||||||||||||||
do we think a scraper should be allowed to take whatever means necessary to scrape a site if that site explicitly denies that scraper access? if someone is abusing my site, and i block them in an attempt to stop that abuse, do we think that they are correct to tell me it doesn’t matter what i think and to use any methods they want to keep abusing it? that seems wrong to me. | ||||||||||||||||||||||||||||||||||||||||||||
| ▲ | BatteryMountain 9 hours ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||||||||||||||
Saying the quiet part out loud...Shhhs | ||||||||||||||||||||||||||||||||||||||||||||
| ▲ | megous 3 hours ago | parent | prev [-] | |||||||||||||||||||||||||||||||||||||||||||
I'd still like the ability to just block a crawler by its IP range, but these days nope. 1 Hz is 86400 hits per day, or 600k hits per week. That's just one crawler. Just checked my access log... 958k hits in a week from 622k unique addresses. 95% is fetching random links from u-boot repository that I host, which is completely random. I blocked all of the GCP/AWS/Alibaba and of course Azure cloud IP ranges. It's almost all now just comming of a "residential" and "mobile" IP address space from completely random places all around the world. I'm pretty sure my u-boot fork is not that popular. :-D Every request is a new IP address, and available IP space of the crawler(s) is millions of addresses. I don't host a popular repo. I host a bot attraction. | ||||||||||||||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||||||||||||||