| ▲ | tommek4077 4 hours ago | |||||||||||||||||||||||||||||||
How do they get overloaded? Is the website too slow? I have a quite big wiki online and barely see any impact from bots. | ||||||||||||||||||||||||||||||||
| ▲ | stinky613 4 hours ago | parent | next [-] | |||||||||||||||||||||||||||||||
A year or two ago I personally encountered scraping bots that were scraping every possible resultant page from a given starting point. So if it scraped a search results page it would also scrape every single distinct combination of facets on that search (including nonsensical combinations e.g. products that match the filter "products where weight<2lbs AND weight>2lbs") We ended up having to block entire ASNs and several subnets (lots from Facebook IPs, interestingly) | ||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||
| ▲ | switz 4 hours ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||
If you have a lot of pages, AI bots will scrape every single one on a loop - wiki's generally don't have anywhere near the number of pages as an incremented entity primary id. I have a few million pages on a tiny website and it gets hammered by AI bots all day long. I can handle it, but it's a nuisance and they're basically just scraping garbage (statistics pages of historical matches or user pages that have essentially no content). Many of them don't even self-identify and end up scraping with shrouded user-agents or via bot-farms. I've had to block entire ASNs just to tone it down. It also hurts good-faith actors who genuinely want to build on top of our APIs because I have to block some cloud providers. I would guess that I'm getting anywhere from 10-25 AI bot requests (maybe more) per real user request - and at scale that ends up being quite a lot. I route bot traffic to separate pods just so it doesn't hinder my real users' experience[0]. Keep in mind that they're hitting deeply cold links so caching doesn't do a whole lot here. [0] this was more of a fun experiment than anything explicitly necessary, but it's proven useful in ways I didn't anticipate | ||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||
| ▲ | blell 4 hours ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||
In these discussions no one will admit this, but the answer is generally yes. Websites written in python and stuff like that. | ||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||
| ▲ | roblh 4 hours ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||
There’s a lot of factors. Depends how well your content lends itself to being cached by a CDN, the tech you (or your predecessors) chose to build it with, and how many unique pages you have. Even with pretty aggressive caching, having a couple million pages indexed adds up real fast. Especially if you weren’t fortunate enough to inherit a project using a framework that makes server side rendering easy. | ||||||||||||||||||||||||||||||||
| ▲ | kpcyrd 4 hours ago | parent | prev [-] | |||||||||||||||||||||||||||||||
The API seems to be written in Perl: https://github.com/metabrainz/musicbrainz-server | ||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||