| ▲ | n1xis10t 18 hours ago | ||||||||||||||||
Interesting. When it was just normal search engines I didn’t hear of people having this problem, so this either means that there are a bunch of people pretending to be bing google and yandex, or those companies have gotten a lot more aggressive. | |||||||||||||||||
| ▲ | bobbiechen 16 hours ago | parent | next [-] | ||||||||||||||||
There are lots of people pretending to be Google and friends. They far outnumber the real Googlebot, etc. and most people don't check the reverse DNS/IP list - it's tedious to do this for even well-behaved crawlers that publish how to ID themselves. So much for User Agent. | |||||||||||||||||
| |||||||||||||||||
| ▲ | reallyhuh 16 hours ago | parent | prev | next [-] | ||||||||||||||||
What are the proportions for the attributions? Is it equally distributed or lopsided towards one of the three? | |||||||||||||||||
| ▲ | giantrobot 15 hours ago | parent | prev [-] | ||||||||||||||||
Normal search engine spiders did/do cause problems but not on the scale of AI scrapers. Search engine spiders tend to follow a robots.txt, look at the sitemap.xml, and generally try to throttle requests. You'll find some that are poorly behaved but they tend to get blocked and either die out or get fixed and behave better. The AI scrapers are atrocious. They just blindly blast every URL on a site with no throttling. They are terribly written and managed as the same scraper will hit the same site multiple times a day or even hour. They also don't pay any attention to context so they'll happily blast git repo hosts and hit expensive endpoints. They're like a constant DOS attack. They're hard to block at the network level because they span across different hyperscalers' IP blocks. | |||||||||||||||||
| |||||||||||||||||