Remix.run Logo
kstrauser 5 days ago

Tencent scrapers are hitting my little Forgejo site 4 times a second, 24/7. I pay for that bandwidth. Platitudes sound great, but this isn’t a lofty “drinking from the public well”. This is bastard operators taking a drink and pooping in it.

My thoughts will have more room for nuance when they stop abusing the hell out of my resources they’re “borrowing”.

psychoslave 5 days ago | parent | next [-]

Why are there even doing so? This doesn’t feel like something that can even bring any value downstream to their own selfish pipelines, or am I missing something?

kstrauser 5 days ago | parent | next [-]

No! They’re constantly hitting the same stupid URL (“show me this file in this commit in this repo with these 47 query params”) from a few thousand IPs in China and Brazil, with user agents showing an iPod or a Linux desktop running Opera 3.

I wrote a little script where I throw in an IP and it generates a Caddy IP-matcher block with an “abort” rule for every netblock in that IP’s ASN. I’m sure there are more elegant ways to share my work with the world while blocking the scoundrels, but this is kind of satisfying for the moment.

danaris 5 days ago | parent | prev [-]

Best I can figure, they've decided that it's easier to set up their scrapers to simply scrape absolutely everything, all the time, forever than to more carefully select what's worth it to get.

Various LLM-training scrapers were absolutely crippling my tiny (~125 weekly unique users) browser game until I put its Wiki behind a login wall. There is no possible way they could see any meaningful return from doing so.

HankStallone 5 days ago | parent [-]

I get the impression that they're just too lazy or incompetent, or in too big a hurry, to put some sensible logic in their scrapers. Maybe they have an LLM write the scraper and don't bother to ask for anything more than "Make a web scraper that gets all the files it can as fast as it can."

The last one I blocked was hitting my site 24 times/second, and a lot of them were the same CSS file over and over.

protocolture 2 days ago | parent | prev [-]

Being a dick while scraping isnt really the same question as the use of that data.

Anyway the answer is block em.