| ▲ | jimrandomh 2 hours ago | ||||||||||||||||||||||||||||||||||||||||
I deal with scrapers that sometimes border on DDoSes for LessWrong. The amount of bot traffic varies greatly between sites; if you have more URLs you get more bot traffic (regardless of whether those URLs represent a deep content catalog, or useless URL parameter permutations). It's bad for LW because of the content-catalog depth. It's easy to drastically underestimate the amount of bot traffic, because bots make efforts (of varying sophistication) to look human enough to evade blocking. That includes using fake user-agent strings corresponding to real browsers (often but not always with implausibly old version numbers), proxying through residential IPs, and sometimes using full headless browsers. In my own data, traffic from badly behaved browser-impersonation bots exceeds traffic from named scrapers like GPTBot by something like 10x. The measured percentage of bot traffic is higher for HTML than for other content types because many bots will load an HTML page, and then not load the JS/CSS/image/etc resources it references. But these are the least-sophisticated and most-detectable bots. | |||||||||||||||||||||||||||||||||||||||||
| ▲ | kev009 an hour ago | parent | next [-] | ||||||||||||||||||||||||||||||||||||||||
Meta comes through with a /24 worth of scrapers and ignores robots.txt. I'm inclined to poison my data with fake information about Zuckerberg. | |||||||||||||||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||||||||||||||
| ▲ | reconnecting an hour ago | parent | prev | next [-] | ||||||||||||||||||||||||||||||||||||||||
When it comes to residential IPs, that you mentioned, these can only be afforded by scrapers that were specifically made for your website and have a financial incentive. I don't believe that someone would spend money on residential IPs just to crawl the entire internet. Browser/IP impersonation bots come from DC network, and there are a dozen or so ASNs where they typically live. General crawlers, from SEO, search engines, meta, alibaba, etc, usually follow robots.txt The result: the real pain is only the first category, where data from your website has some financial value. But this isn't an infinite number of bots — depending on the business, they're countable amount. | |||||||||||||||||||||||||||||||||||||||||
| ▲ | arjie an hour ago | parent | prev | next [-] | ||||||||||||||||||||||||||||||||||||||||
Does LW have a downloadable archive? I can only find references to GreaterWrong but no public answer. Would be useful. | |||||||||||||||||||||||||||||||||||||||||
| ▲ | sometimelurker 26 minutes ago | parent | prev [-] | ||||||||||||||||||||||||||||||||||||||||
thank you for maintaining LessWrong | |||||||||||||||||||||||||||||||||||||||||