| ▲ | hectormalot 7 hours ago | ||||||||||||||||||||||||||||
Maybe I’m naive about this, but I didn’t expect AI scrapers to be that big of a load? I mean, it’s not that they need to scrape the same at 1000+ QPS, and even then I wouldn’t expect them to download all media and images either? What am I missing that explains the gap between this and “constant DDoS” of the site? | |||||||||||||||||||||||||||||
| ▲ | thresh 6 hours ago | parent | next [-] | ||||||||||||||||||||||||||||
You cant really cache the dynamic content produced by the forges like Gitlab and, say, web forums like phpbb. So it means every request gets through the slow path. Media/JS is of course cached on the edge, so it's not an issue. Even when the amount of AI requests isnt that high - generally it's in hundreds per second tops for our services combined - that's still a load that causes issues for legitimate users/developers. We've seen it grow from somewhat reasonable to pretty much being 99% of responses we serve. Can it be solved by throwing more hardware at the problem? Sure. But it's not sustainable, and the reasonable approach in our case is to filter off the parasitic traffic. | |||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||
| ▲ | nijave 7 hours ago | parent | prev | next [-] | ||||||||||||||||||||||||||||
I think there's a few things at play here - AI scrapers will pull a bunch of docs from many sites in parallel (so instead of a human request where someone picks a single Google result, it hits a bunch of sites) - AI will crawl the site looking for the correct answer which may hit a handful of pages - AI sends requests in quick succession (big bursts instead of small trickle over longer time) - Personal assistants may crawl the site repeatedly scraping everything (we saw a fair bit of this at work, they announced themselves with user agents) - At work (b2b SaaS webapp) we also found that the personal assistant variety tended to hammer really computationally expensive data export and reporting endpoints generally without filters. While our app technically supported it, it was very inorganic traffic That said, I don't think the solution is blanket blocks. Really it's exposing sites are poorly optimized for emerging technology. | |||||||||||||||||||||||||||||
| ▲ | Y-bar 7 hours ago | parent | prev | next [-] | ||||||||||||||||||||||||||||
They are a scourge, they never rate-limit themselves, there are a hundred of them, and a significant number don’t respect robots.txt. Many of them also end up our meta:no-index,no-follow search pages leading to cost overruns on our Algolia usage. We spend way too much time adjusting WAF and other bot-controls than we should have. | |||||||||||||||||||||||||||||
| ▲ | eipi10_hn 3 hours ago | parent | prev | next [-] | ||||||||||||||||||||||||||||
Yes, it's that BIG of a load: https://status.sr.ht/issues/2025-03-17-git.sr.ht-llms/ | |||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||
| ▲ | 7 hours ago | parent | prev [-] | ||||||||||||||||||||||||||||
| [deleted] | |||||||||||||||||||||||||||||