Remix.run Logo
simonw 5 hours ago

> These bots are almost certainly scraping data for AI training; normal bad actors don't have funding for millions of unique IPs thrown at a page. They probably belong to several different companies. Perhaps they sell their scraped data to AI companies, or they are AI companies themselves. We can't tell, but we can guess since there aren't all that many large AI corporations out there.

Is the theory here that OpenAI, Anthropic, Gemini, xAI, Qwen, Z.ai etc are all either running bad scrapers via domestic proxies in Indonesia, or are buying data from companies that run those scrapers?

I want to know for sure. Who is paying for this activity? What does the marketplace for scraped data look like?

marginalia_nu 4 hours ago | parent | next [-]

I agree it's a more than a bit handwavy. The common consensus seems to be that AI companies are driving this, but it's really hard to conclusively prove who or what is behind the attacks.

Weird part #1 is that the traffic isn't for the most part shaped like crawler traffic. It's incredibly bursty, and heavily redundant, missing even the most obvious low hanging fruit optimizations.

Could be someone is using residential proxies to wrap AI agents' web traffic, but even so, there's a lot of pieces that don't really make sense, like why the traffic pattern is like being hit by a shotgun. It isn't just one request, but anywhere between 40 and 100 redundant requests.

A popular theory is that this is because of sloppy coding, AI companies are too rich to care, but then again that doesn't really add up. This isn't just a minor inefficiency, if it is "just" bad coding, they stand to gain monumental efficiency improvements by fixing the issues, in the sense of getting the data much faster, a clear competitive edge.

Really weird.

My unsubstantiated guess is the residential proxy/botnet is very unreliable, and that's why they fire so many request. Makes sense if it's sold as a service.

gamesieve 3 hours ago | parent | next [-]

I suspect the redundant requests are primarily designed to weed out poisoned data served on otherwise valid URLs. I've also seen the redundant requests increase massively the more sources I blocked at the firewall level, so it feels like they're pre-emptively overcompensating for some percentage of requests being blocked.

My website contains ~6000 unique data points in effectively infinite combinations on effectively infinite pages. Some of those combinations are useful for humans, but the AI-scrapers could gain a near-infinite efficiency improvement by just identifying as a bot and heeding my robots.txt and/or rel="nofollow" hints to access the ~500 top level pages which contain close to everything which is unique. They just don't care. All their efficiency attempts are directed solely toward bypassing blocks. (Today I saw them varying the numbers in their user agent strings: X15 rather than X11, Chrome/532 rather than Chrome/132, and so on...)

oasisbob 3 hours ago | parent | prev [-]

> A popular theory is that this is because of sloppy coding, AI companies are too rich to care, but then again that doesn't really add up

I can substantiate this a bit. Verified traffic from Amazonbot is too dumb to do anything with 429s. They will happily slam your site with more traffic than you can handle, and will completely ignore the fact that over half the responses are useless rate limits.

They say they honor REP, but Amazonbot will still hit you pretty persistently even with a full disallow directive in robots.txt

marginalia_nu 3 hours ago | parent [-]

How do you know it's Amazonbot?

oasisbob 3 hours ago | parent [-]

User Agent, SWIPed IP space, and the PTR records resolving to an Amazon-controlled crawl zone.

oasisbob 4 hours ago | parent | prev | next [-]

I want more data too.

The root sources of the traffic from residential proxies gets murky very quickly.

It's easy to follow the chain partway for some traffic, eg "Why are we receiving all this traffic from Digital Ocean? ... oh, it's their hero client Firecrawl, using a deceptive UserAgent" ... but it still leaves the obvious question about who the Firecrawl client is.

Res proxy traffic is insane these days. There is also plenty of grey-market snowshoe IPs available for the right price, from a handful of ASNs. I regularly see unified crawling missions by unknown agents using 1000+ "clean" IP addresses an hour.

ghywertelling 5 hours ago | parent | prev [-]

https://parallel.ai/

I bet lot of companies want to provide search results to AI agents.