Remix.run Logo
jimrandomh 2 hours ago

I deal with scrapers that sometimes border on DDoSes for LessWrong. The amount of bot traffic varies greatly between sites; if you have more URLs you get more bot traffic (regardless of whether those URLs represent a deep content catalog, or useless URL parameter permutations). It's bad for LW because of the content-catalog depth.

It's easy to drastically underestimate the amount of bot traffic, because bots make efforts (of varying sophistication) to look human enough to evade blocking. That includes using fake user-agent strings corresponding to real browsers (often but not always with implausibly old version numbers), proxying through residential IPs, and sometimes using full headless browsers. In my own data, traffic from badly behaved browser-impersonation bots exceeds traffic from named scrapers like GPTBot by something like 10x.

The measured percentage of bot traffic is higher for HTML than for other content types because many bots will load an HTML page, and then not load the JS/CSS/image/etc resources it references. But these are the least-sophisticated and most-detectable bots.

kev009 an hour ago | parent | next [-]

Meta comes through with a /24 worth of scrapers and ignores robots.txt. I'm inclined to poison my data with fake information about Zuckerberg.

reconnecting an hour ago | parent [-]

Did you check IP addresses, are they all from AS32934?

kev009 an hour ago | parent [-]

Yes

57.141.0.42 - - [05/Jun/2026:19:50:19 +0000] "GET /mid/a017bc62-0982-42db-8403-241d69da8d0f@alexander-goetzenstein.my-fqdn.de HTTP/2.0" 303 0 "-" "meta-externalagent/1.1 (+https://developers.facebook.com/docs/sharing/webmasters/craw...)"

57.141.0.48 - - [05/Jun/2026:19:50:22 +0000] "GET /group/comp.os.linux.advocacy/a/a236f5a5-63a4-4982-8bb6-07ffc684201b@googlegroups.com HTTP/2.0" 200 34838 "-" "meta-externalagent/1.1 (+https://developers.facebook.com/docs/sharing/webmasters/craw...)"

57.141.0.55 - - [05/Jun/2026:19:50:23 +0000] "GET /group/alt.recovery.aa/a/ne6onq%24hpp%241@dont-email.me HTTP/2.0" 200 5606 "-" "meta-externalagent/1.1 (+https://developers.facebook.com/docs/sharing/webmasters/craw...)"

57.141.0.56 - - [05/Jun/2026:19:50:24 +0000] "GET /group/aioe.news.assistenza/a/qpukie%241i1g%241@neodome.net?view=headers HTTP/2.0" 200 17027 "-" "meta-externalagent/1.1 (+https://developers.facebook.com/docs/sharing/webmasters/craw...)"

57.141.0.36 - - [05/Jun/2026:19:50:29 +0000] "GET /group/alt.obituaries/a/uf8pej%241hqi1%241@news.xmission.com HTTP/2.0" 200 6123 "-" "meta-externalagent/1.1 (+https://developers.facebook.com/docs/sharing/webmasters/craw...)"

57.141.0.66 - - [05/Jun/2026:19:50:29 +0000] "GET /group/comp.theory/a/v3640k%24vg63%243@dont-email.me HTTP/2.0" 200 148720 "-" "meta-externalagent/1.1 (+https://developers.facebook.com/docs/sharing/webmasters/craw...)"

reconnecting an hour ago | parent [-]

And assume you have

User-agent: meta-externalagent

Disallow: /

Symbiote 43 minutes ago | parent | next [-]

I have observed the same from Meta's crawler.

  User-agent: *
  Disallow: /
on e.g. our preproduction site, Meta is the only big-tech crawler that accesses it, at least with an honest user agent. (Meta also accesses disallowed paths on the production site.)
kev009 an hour ago | parent | prev [-]

They don't obey *, they don't get their own entry. I'd rather just poison their data, it's a well known behavior from them.

https://www.reddit.com/r/webdev/comments/1sdzd1q/metas_ai_cr...

reconnecting an hour ago | parent | prev | next [-]

When it comes to residential IPs, that you mentioned, these can only be afforded by scrapers that were specifically made for your website and have a financial incentive. I don't believe that someone would spend money on residential IPs just to crawl the entire internet.

Browser/IP impersonation bots come from DC network, and there are a dozen or so ASNs where they typically live.

General crawlers, from SEO, search engines, meta, alibaba, etc, usually follow robots.txt

The result: the real pain is only the first category, where data from your website has some financial value. But this isn't an infinite number of bots — depending on the business, they're countable amount.

arjie an hour ago | parent | prev | next [-]

Does LW have a downloadable archive? I can only find references to GreaterWrong but no public answer. Would be useful.

sometimelurker 26 minutes ago | parent | prev [-]

thank you for maintaining LessWrong