Remix.run Logo
VladVladikoff 2 hours ago

As a site operator who has been battling with the influx of extremely aggressive AI crawlers, I’m now wondering if my tactics have accidentally blocked internet archive. I am totally ok with them scraping my site, they would likely obey robots.txt, but these days even Facebook ignores it, and exceeds my stipulated crawl delay by distributing their traffic across many IPs. (I even have a special nginx rule just for Facebook.)

Blocking certain JA3 hashes has so far been the most effective counter measures. However I wish there was an nginx wrapper around hugin-net that could help me do TCP fingerprinting as well. As I do not know rust and feel terrified of asking an LLM to make it. There is also a race condition issue with that approach, as it is passive fingerprinting even the JA4 hashes won’t be available for the first connection, and the AI crawlers I’ve seen do one request per IP so you don’t get a chance to block the second request (never happens).

danrl an hour ago | parent | next [-]

> they would likely obey robots.txt

If only... Despite providing a useful service, they are not as nice towards site owners as one would hope.

Internet Archive says:

> We see the future of web archiving relying less on robots.txt file declarations geared toward search engines

https://blog.archive.org/2017/04/17/robots-txt-meant-for-sea...

They are not alone in that. The "Archiveteam", a different organization, not to be confused with archive.org, also doesn't respect robots.txt according to their wiki: https://wiki.archiveteam.org/index.php?title=Robots.txt

I think it is safe to say that there is little consideration for site owners from the largest archiving organizations today. Whether there should be is a different debate.

mycall 2 hours ago | parent | prev | next [-]

Evasion techniques like JA3 randomization or impersonation can bypass detection.

andrepd 2 hours ago | parent | prev [-]

I wonder if it would be practical to have bot-blocking measures that can be bypassed with a signature from a set of whitelisted keys... In this case the server would be happy to allow Internet Archive crawlers.

freedomben 2 hours ago | parent [-]

That's an interesting idea. Mtls could probably be used for this pretty easily. It would require IA to support it if course, but could be a nice solution. I wonder, do they already support it? I might throw up a test...