▲ | terminalshort 2 days ago | ||||||||||||||||||||||||||||||||||||||||||||||||||||
How real is this "crawler plague" that the author refers to? I haven't seen it. But that's just as likely to because I don't care, and therefore am not looking, as it is to be because it's not there. Loading static pages from CDN to scrape training data takes such minimal amounts of resources that it's never going to be a significant part of my costs. Are there cases where this isn't true? | |||||||||||||||||||||||||||||||||||||||||||||||||||||
▲ | ApeWithCompiler 2 days ago | parent | next [-] | ||||||||||||||||||||||||||||||||||||||||||||||||||||
The following is the best I could collect quickly to provide backup to the statement. Unfortunally it's not the high quality first instance of raw statistics I would have liked. But from what I have read time to time the crawler acted magnitudes outside of what could have been accepted as just badly configured. https://herman.bearblog.dev/the-great-scrape/ https://drewdevault.com/2025/03/17/2025-03-17-Stop-externali... https://lwn.net/Articles/1008897/ https://tecnobits.com/en/AI-crawlers-on-Wikipedia-platform-d... | |||||||||||||||||||||||||||||||||||||||||||||||||||||
▲ | hombre_fatal 2 days ago | parent | prev | next [-] | ||||||||||||||||||||||||||||||||||||||||||||||||||||
My forum traffic went up 10x due to bots a few months ago. Never seen anything like it. > Loading static pages from CDN to scrape training data takes such minimal amounts of resources that it's never going to be a significant part of my costs. Are there cases where this isn't true? Why did you bring up static pages served by a CDN, the absolute best case scenario, as your reference for how crawler spam might affect server performance? | |||||||||||||||||||||||||||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||||||||||||||||||||||||||
▲ | n3storm 2 days ago | parent | prev | next [-] | ||||||||||||||||||||||||||||||||||||||||||||||||||||
My estimation is at least 70% of traffic on small sites 300-3000 daily views, is not human | |||||||||||||||||||||||||||||||||||||||||||||||||||||
▲ | snowwrestler 2 days ago | parent | prev | next [-] | ||||||||||||||||||||||||||||||||||||||||||||||||||||
Yes, it’s true. Most sites don’t have a forever cache TTL so a crawler that hits every page on a database-backed site is going to hit mostly uncached pages (and therefore the DB). I also have a faceted search that some stupid crawler has spent the last month iterating through. Also mostly uncached URLs. | |||||||||||||||||||||||||||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||||||||||||||||||||||||||
▲ | zzzeek 2 days ago | parent | prev | next [-] | ||||||||||||||||||||||||||||||||||||||||||||||||||||
I just had to purchase a cloudflare account to protect two of my sites used for CI that run Jenkins and Gerrit servers. These are resource-hungry java VMs which I have running on a minimally powered server as they are intended to be accessed only by a few people, yet crawlers located in eastern Europe and Asia eventually found it and would regularly drive my CPU up to 500% and make the server unavailable (it should go without saying I have always had a robots.txt on these sites that prohibit all crawling. Such files are a quaint relic of a simpler time). For a couple of years I'd block the various offending IPs, but this past month the crawling resumed again this time intentionally swarmed across hundreds of IP numbers so that I could not easily block them. Cloudflare was able to show me within minutes the entirety of the IP numbers came from a single ASN owned by a very large and well known Chinese company and I blocked the entire ASN. While I could figure out these ASNs manually and get blocklists to add to apache config, Cloudflare makes it super easy showing you the whole thing happening in realtime. You can even tailor the 403 response to send them a custom message, in my case, "ALL of the data you are crawling is on github! Get off these servers and go get it there!" (again sure I could write out httpd config for all of that but who wants to bother). They are definitely providing a really critical service. | |||||||||||||||||||||||||||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||||||||||||||||||||||||||
▲ | 2 days ago | parent | prev | next [-] | ||||||||||||||||||||||||||||||||||||||||||||||||||||
[deleted] | |||||||||||||||||||||||||||||||||||||||||||||||||||||
▲ | danaris 2 days ago | parent | prev | next [-] | ||||||||||||||||||||||||||||||||||||||||||||||||||||
It's very real. It's crashed my site a number of times. | |||||||||||||||||||||||||||||||||||||||||||||||||||||
▲ | decremental 2 days ago | parent | prev [-] | ||||||||||||||||||||||||||||||||||||||||||||||||||||
[dead] |