| ▲ | Symbiote 2 hours ago | |
It's not copyright data (academia / library stuff), so the concern is the DDoS. If researches write to us, we send them a database dump or export, but only one AI company has ever written. So far there's always been some pattern to allow a block/challenge, e.g. user agent, JA3 / JA4, ASN. (I haven't looked at TCP SYN TTL before.) Usually the IPs are 80% or so in one country (e.g. Brazil, US, Vietnam or India) with the rest all over the world, mostly consumer ISPs although I haven't distinguished between fixed line and mobile. We tried Cloudflare for a couple of months, on a paid plan, which I think blocked many of the non-distributed crawlers, but didn't help much with these distributed ones. Meanwhile we have been reducing the cost of rendering the pages. | ||
| ▲ | Bender 2 hours ago | parent [-] | |
Usually the IPs are 80% or so in one country (e.g. Brazil, US, Vietnam or India) It sounds like there is the option to at least reduce the load by up to 80%. That's at least a start. This repo [1] is not perfect but it's a start. I would disable IPv6 access to the site after removing the IPv6 DNS records and waiting a day so that attackers are forced onto IPv4 and clone this repo [1] or use one of the GeoIP databases to limit access from specific countries. That repo also contains known proxies. That may account for another percentage of that remaining 20%. As the content is academic in nature I don't know how your team feels about blocking Tor, but there is also a list of many (not all) of the last 30 days of Tor exit nodes in that repo. The blocks would not have to be permanent, just enabled during the storms if your team so desired.
Example of country IP addresses for Brazil [2]As this is academia I don't know if there is a concept of service level agreement or a promise of availability, but during attacks the requests to specific URL's could be redirected to a static pre-compressed landing page served out of memory that says "Access to these documents limited during AI bot attacks, here is where to request a full download instead: " I forgot to mention, many of the AI bots are limited to HTTP/1.1. During an attack your server could redirect those request to a static page as well. Curious if you can tell by your access logs if the majority of the attack is HTTP/1.1 or HTTP/2.0. Most botters are too lazy to implement HTTP/2.0. Real browser clients use HTTP/2.0. Example of using HTTP/1.1 and HTTP/2.0
Some bots also do not hide that they are bots in that they say they are bot in their user-agent client header. That would be too easy so I doubt that is the case.[1] - https://github.com/firehol/blocklist-ipsets/ [2] - https://github.com/firehol/blocklist-ipsets/blob/master/ipip... | ||