Messing with Scraper Bots

▲ Messing with Scraper Bots(herman.bearblog.dev)

107 points by HermanMartinus 8 hours ago | 36 comments

▲ simondotau 3 hours ago | parent | next [-]

The more things change, the more they stay the same.

About 10-15 years ago, the scourge I was fighting was social media monitoring services, companies paid by big brands to watch sentiment across forums and other online communities. I was running a very popular and completely free (and ad-free) discussion forum in my spare time, and their scraping was irritating for two reasons. First, they were monetising my community when I wasn’t. Second, their crawlers would hit the servers as hard as they could, creating real load issues. I kept having to beg our hosting sponsor for more capacity.

Once I figured out what was happening, I blocked their user agent. Within a week they were scraping with a generic one. I blocked their IP range; a week later they were back on a different range. So I built a filter that would pseudo-randomly[0] inject company names[1] into forum posts. Then any time I re-identified[2] their bot, I enabled that filter for their requests.

The scraping stopped within two days and never came back.

[0] Random but deterministic based on post ID, so the injected text stayed consistent.

[1] I collated a list of around 100 major consumer brands, plus every company name the monitoring services proudly listed as clients on their own websites.

[2] This was back around 2009 or so, so things weren't nearly as sophisticated as they are today, both in terms of bots and anti-bot strategies. One of the most effective tools I remember deploying back then was analysis of all HTTP headers. Bots would spoof a browser UA, but almost none would get the full header set right, things like Accept-Encoding or Accept-Language were either absent, or static strings that didn't exactly match what the real browser would ever send.

▲ VladVladikoff 3 hours ago | parent | prev | next [-]

This is a fundamental misunderstanding of what those bots are requesting. They aren’t parsing those PHP files, they are using their existence for fingerprinting — they are trying to determine the existence of known vulnerabilities. They probably immediately stop reading after receiving a http response code and discard the remainder of the request packets.

▲ Kiro 3 hours ago | parent | prev | next [-]

I remember when you used to get scolded on HN for preventing scrapers or bots. "How I access your site is irrelevant".

▲ ArcHound 7 hours ago | parent | prev | next [-]

Neat! Most of the offensive scrapers I met try and exploit WordPress sites (hence the focus on PHP). They don't want to see php files, but their outputs.

What you have here is quite close to a honeypot, sadly I don't see an easy way to counter-abuse such bots. If the attack is not following their script, they move on.

	▲	jojobas 3 hours ago \| parent [-]
		Yeah, I bet they run a regex on the output and if there's no admin logon thingie where they can run exploits or stuff credentials they'll just skip. So as to battles of efficiency, generating a 4kb bullshit PHP is harder than running a regex.

▲ jcynix 6 hours ago | parent | prev | next [-]

If you control your own Apache server and just want to shortcut to "go away" instead of feeding scrapers, the RewriteEngine is your friend, for example:

      RewriteEngine On

      # Block requests that reference .php anywhere (path, query, or encoded)
      RewriteCond %{REQUEST_URI} (\.php|%2ephp|%2e%70%68%70) [NC,OR]
      RewriteCond %{QUERY_STRING} \.php [NC,OR]
      RewriteCond %{THE_REQUEST} \.php [NC]
      RewriteRule .* - [F,L]

Notes: there's no PHP on my servers, so if someone asks for it, they are one of the "bad boys" IMHO. Your mileage may differ.

▲ palsecam 2 hours ago | parent [-]

I do something quite similar with nginx:

  # Nothing to hack around here, I’m just a teapot:
  location ~* \.(?:php|aspx?|jsp|dll|sql|bak)$ { 
      return 418; 
  }
  error_page 418 /418.html;

No hard block, instead reply to bots the funny HTTP 418 code (https://developer.mozilla.org/en-US/docs/Web/HTTP/Reference/...). That makes filtering logs easier.

Live example: https://FreeSolitaire.win/wp-login.php (NB: /wp-login.php is WordPress login URL, and it’s commonly blindly requested by bots searching for weak WordPress installs.)

	▲	jcynix an hour ago \| parent \| next [-]
		418? Nice I'll think about it ;-) I would, in addition, prefer that "402 Payment Required" would be instantiated for scrapers ... https://developer.mozilla.org/en-US/docs/Web/HTTP/Reference/...
	▲	kijin an hour ago \| parent \| prev [-]
		nginx also has "return 444", a special code that makes it drop the connection altogether. This is quite useful if you don't even want to waste any bandwidth serving an error page. You have an image on your error page, which some crappy bots will download over and over again.

▲ iam-TJ 5 hours ago | parent | prev | next [-]

This reminds me of a recent discussion about using a tarpit for A.I. and other scrapers. I've kept a tab alive with a reference to a neat tool and approach called Nepenthes that VERY SLOWLY drip feeds endless generated data into the connection. I've not had an opportunity to experiment with it as yet:

https://zadzmo.org/code/nepenthes/

▲ vachina 2 hours ago | parent | prev | next [-]

They’re not scraping for php files, they’re probing for known vulns in popular frameworks, and then using them as entry points for pwning.

This is done very efficiently. If you return anything unexpected, they’ll just drop you and move on.

▲ aduwah 4 hours ago | parent | prev | next [-]

I wonder if the abuse bots could be somehow made to mine some crypto to give back to the bills they cause

	▲	boxedemp 4 hours ago \| parent [-]
		You could try to get them to run JavaScript, but I'm sure many is them have countermeasures.

▲ s0meON3 6 hours ago | parent | prev | next [-]

What about using zip bombs?

https://idiallo.com/blog/zipbomb-protection

▲

lavela 6 hours ago | parent | next [-]

"Gzip only provides a compression ratio of a little over 1000: If I want a file that expands to 100 GB, I’ve got to serve a 100 MB asset. Worse, when I tried it, the bots just shrugged it off, with some even coming back for more."

https://maurycyz.com/misc/the_cost_of_trash/#:~:text=throw%2...

	▲	LunaSea 4 hours ago \| parent [-]
		You could try different compression methods supported by browsers like brotli. Otherwise you can also chain compression methods like: "Content-Encoding: gzip gzip".

▲

renegat0x0 5 hours ago | parent | prev [-]

Even I, who does not know much, implemented a workaround.

I have a web crawler and I have both scraping byte limit and timeout, so zip bombs dont bother me much.

https://github.com/rumca-js/crawler-buddy

I think garbage blabber would be more effective.

▲ Surac 4 hours ago | parent | prev | next [-]

I have just cut out up ranges that can not connect. I am blocking USA, Asia and Middle East to prevent most malicious accesses

▲

breppp 3 hours ago | parent [-]

Blocking most of the world's population is one way of reducing malicious traffic

▲

gessha 3 hours ago | parent | next [-]

If nobody can connect to your site, it’s perfectly secure.

▲

warkdarrior 2 hours ago | parent | prev [-]

Make sure to block your own IP address to minimize the chance of a social engineering attack.

	▲	bot403 an hour ago \| parent [-]
		Include 127.0.0.1 as well just in case they get into the server.

▲ localhostinger 6 hours ago | parent | prev | next [-]

Interesting! It's nice to see people are experimenting with these, and I wonder if this kind of junk data generators will become its own product. Or maybe at least a feature/integration in existing software. I could see it going there.

▲ NoiseBert69 7 hours ago | parent | prev | next [-]

Hm.. why not using dumbed down small, self-hosted LLM networks to feet the big scrapers with bullshit?

I'd sacrifice two CPU cores for this just to make their life awful.

	▲	Findecanor 4 hours ago \| parent \| next [-]
		You don't need an LLM for that. There is a link in the article to an approach using Markov chains created from real-world books, but then you'd let the scrapers' LLMs re-enforce their training on those books and not on random garbage. I would make a list of words from each word class, and a list of sentence structures where each item is a word class. Pick a pseudo-random sentence; for each word class in the sentence, pick a pseudo-random word; output; repeat. That should be pretty simple and fast. I'd think the most important thing though is to add delays to serving the requests. The purpose is to slow the scrapers down, not to induce demand on your garbage well.
	▲	qezz 4 hours ago \| parent \| prev [-]
		That's very expensive.

▲ re-lre-l 5 hours ago | parent | prev [-]

Don’t get me wrong, but what’s the problem with scrapers? People invest in SEO to become more visible, yet at the same time they fight against “scraper bots.” I’ve always thought the whole point of publicly available information is to be visible. If you want to make money, just put it behind a paywall. Isn’t that the idea?

▲

georgefrowny 5 hours ago | parent | next [-]

There's a difference between putting information easily online for your customers or even people in general (eg as a hobby), and working in concert with scraping for greater visibility via search, and giving that work away, or at a cost, to companies who at best don't care and possibly may be competition, see themselves as replacing you or otherwise adversarial.

The line is "I technically and able to do this" and "I am engaging with a system in good faith".

Public parks are just there and I can technically drive up and dump rubbish there and if they didn't want me to they should have installed a gate and sold tickets.

Many scrapers these days are sort of equivalent in that analogy to people starting entire fleets of waste disposal vehicles that all drive to parks to unload, putting strain on park operations and making the parks a less tenable service in general.

▲

nrhrjrjrjtntbt 5 hours ago | parent | prev | next [-]

The old scrapers indexed your site so you may get traffic. This benefits you.

AI scrapers will plagiarise your work and bring you zero traffic.

▲

ProofHouse 4 hours ago | parent [-]

Ya make sure you hold dear that grain of sand on a beach of pre-training data that is used to slightly adjust some embedding weights

	▲	jcynix 3 hours ago \| parent \| next [-]
		Sand is the world's second most used natural resource and sand usable for concrete gets even illegally removed all over the world nowadays. So to continue your analogy, I made my part of the beach accessible for visitors to enjoy, but certain people think they can carry it away for their own purpose ...
	▲	throwawa14223 an hour ago \| parent \| prev \| next [-]
		I have no reason to help the richest companies on earth adjust weights at a cost to myself.
	▲	boxedemp 3 hours ago \| parent \| prev \| next [-]
		One Reddit post can get an LLM to recommend putting glue in your pizza. But the takeaway here is to cheese the bots.
	▲	exe34 3 hours ago \| parent \| prev [-]
		that grain of sand used to bring traffic, now it doesn't. it's pretty much an economic catastrophe for those who relied on it. and it's not free to provide the data to those who will replace you - they abuse your servers while doing it.

▲

saltysalt 4 hours ago | parent | prev | next [-]

You are correct, and the hard reality is that content producers don't get to pick and choose who gets to index their public content because the bad bots don't play by the rules of robots.txt or user-agent strings. In my experience, bad bots do everything they can to identify as regular users: fake IPs, fake agent strings...so it's hard to sort them from regular traffic.

▲

Dilettante_ 4 hours ago | parent | prev [-]

Did you read TFA?

These scrapers drown peoples' servers in requests, taking up literally all the resources and driving up cost.