Remix.run Logo
Symbiote an hour ago

My employer's site was recording 1,500 requests per second from a single AI bot earlier this week. The requests came from 2.4 million different IPs at the time I looked, between 1-2 requests from each IP, most likely all were unique URLs. That single bot was 55% of traffic. This kind of crawling pushes us to (sometimes beyond) the limit of our capacity.

I have also seen thousands of requests per hour from the IP to a small set of pages, e.g. the homepage. I don't know why; it doesn't matter so I ignore it.

I've recently found there are websites offering curated "AI ready" datasets, and several of these sites claim to have indexed our site, on the 3-4 I looked at it was one of a few hundred datasets. It's interesting enough to be something an AI company would want, so my conclusion is the site is being specifically targeted by the AI bot developers.

Bender an hour ago | parent [-]

the site is being specifically targeted by the AI bot developers

As I was reading your comment it sounded like a targeted attack. I think you are right that it was targeted. I assume you have done research on what content could be rate limited by URI target vs. source IP and give people a message saying content temporarily unavailable due to AI bot attack?

Is the concern that your site is being DDoS'd or that they are reselling your copyrighted material? If reselling I would get corporate lawyers involved and seek damages I am not a lawyer. Feds could subpoena some of the providers for identity of the attackers.

If the concern is DDoS have your team done any analysis of the clients to see what is in common? Based on the number of IP address you are talking about I assume it must be from wireless carriers. Have you looked at TCPSYN TTL and other characteristics? If there is anything in common those connections could be routed internally to another listener that has tighter rate limits meaning that perhaps cellular users could find some content not available until the bots go away or they randomly get one of a dozen different captchas or random javascript puzzles to access each document until the storm subsides. The puzzles could probably be regenerated hourly by AI to keep the attackers on their toes. Another option would be to require an account to access the documents and limit the number of documents each account can download per hour and / or day and / or week then add more friction to account creation or limit account creation to address space of countries you do business with after blocking most proxies and VPN providers.

Another option to limit the blast zone of an attack is to block countries that one does not do business in but that depends on your business model.

CDN's like Cloudflare are not doing anything magic. If they can block the bots so can just about anyone else. Without seeing samples of the attacks I could not make many more suggestions.

Symbiote 44 minutes ago | parent [-]

It's not copyright data (academia / library stuff), so the concern is the DDoS. If researches write to us, we send them a database dump or export, but only one AI company has ever written.

So far there's always been some pattern to allow a block/challenge, e.g. user agent, JA3 / JA4, ASN. (I haven't looked at TCP SYN TTL before.) Usually the IPs are 80% or so in one country (e.g. Brazil, US, Vietnam or India) with the rest all over the world, mostly consumer ISPs although I haven't distinguished between fixed line and mobile.

We tried Cloudflare for a couple of months, on a paid plan, which I think blocked many of the non-distributed crawlers, but didn't help much with these distributed ones.

Meanwhile we have been reducing the cost of rendering the pages.

Bender 38 minutes ago | parent [-]

Usually the IPs are 80% or so in one country (e.g. Brazil, US, Vietnam or India)

It sounds like there is the option to at least reduce the load by up to 80%. That's at least a start.

This repo [1] is not perfect but it's a start. I would disable IPv6 access to the site after removing the IPv6 DNS records and waiting a day so that attackers are forced onto IPv4 and clone this repo [1] or use one of the GeoIP databases to limit access from specific countries.

That repo also contains known proxies. That may account for another percentage of that remaining 20%.

As the content is academic in nature I don't know how your team feels about blocking Tor, but there is also a list of many (not all) of the last 30 days of Tor exit nodes in that repo.

The blocks would not have to be permanent, just enabled during the storms if your team so desired.

    for BadIP in $(grep -Ev ^# ipip_country_br.netset); do ip route add blackhole "${BadIP}" 2>/dev/null;done
Example of country IP addresses for Brazil [2]

As this is academia I don't know if there is a concept of service level agreement or a promise of availability, but during attacks the requests to specific URL's could be redirected to a static pre-compressed landing page served out of memory that says "Access to these documents limited during AI bot attacks, here is where to request a full download instead: "

I forgot to mention, many of the AI bots are limited to HTTP/1.1. During an attack your server could redirect those request to a static page as well. Curious if you can tell by your access logs if the majority of the attack is HTTP/1.1 or HTTP/2.0. Most botters are too lazy to implement HTTP/2.0. Real browser clients use HTTP/2.0. Example of using HTTP/1.1 and HTTP/2.0

    curl -i --http2 \
     -A "Mozilla/5.0 (X11; Linux x86_64; rv:151.0) Gecko/20100101 Firefox/151.0" \
     -H "sec-fetch-mode: navigate" \
     -H "accept-language: en-US,en;q=0.9" \
     --url 'https://blawg.nochan.net/.env'
 
    curl -i --http1.1 \
     -A "Mozilla/5.0 (X11; Linux x86_64; rv:151.0) Gecko/20100101 Firefox/151.0" \
    -H "sec-fetch-mode: navigate" \
    -H "accept-language: en-US,en;q=0.9" \
    --url 'https://blawg.nochan.net/.env'
Some bots also do not hide that they are bots in that they say they are bot in their user-agent client header. That would be too easy so I doubt that is the case.

[1] - https://github.com/firehol/blocklist-ipsets/

[2] - https://github.com/firehol/blocklist-ipsets/blob/master/ipip...