▲ | hartator 2 days ago | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
There are already “infinite” websites like these on the Internet. Crawlers (both AI and regular search) have a set number of pages they want to crawl per domain. This number is usually determined by the popularity of the domain. Unknown websites will get very few crawls per day whereas popular sites millions. Source: I am the CEO of SerpApi. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
▲ | dawnerd 2 days ago | parent | next [-] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Looking at my logs for all of my sites and this isn’t a global truth. I see multiple ai crawlers hammering away requesting the same pages many, many times. Perplexity and Facebook are basically nonstop. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
▲ | palmfacehn 2 days ago | parent | prev | next [-] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Even a brand new site will get hit heavily by crawlers. Amazonbot, Applebot, LLM bots, scrapers abusing FB's link preview bot, SEO metric bots and more than a few crawlers out of China. The desirable, well behaved crawlers are the only ones who might lose interest. The typical entry point is a sitemap or RSS feed. Overall I think the author is misguided in using the tarpit approach. Slow sites get less crawls. I would suggest using easily GZIP'd content and deeply nested tags instead. There are also tricks with XSL, but I doubt many mature crawlers will fall for that one. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
▲ | pilif 2 days ago | parent | prev | next [-] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
> Unknown websites will get very few crawls per day whereas popular sites millions. we're hosting some pretty unknown very domain specific sites and are getting hammered by Claude and others who, compared to old-school search engine bots also get caught up in the weeds and request the same pages all over. They also seem to not care about response time of the page they are fetching, because when they are caught in the weeds and hit some super bad performing edge-cases, they do not seem to throttle at all and continue to request at 30+ requests per second even when a page takes more than a second to be returned. We can of course handle this and make them go away, but in the end, this behavior will only hurt them both because they will face more and more opposition by web masters and because they are wasting their resources. For decades, our solution for search engine bots was basically an empty robots.txt and have the bots deal with our sites. Bots behaved reasonably and intelligently enough that this was a working strategy. Now in light of the current AI bots which from an outsider observer's viewpoint look like they were cobbled together with the least effort possible, this strategy is no longer viable and we would have to resort to provide a meticulously crafted robots.txt to help each hacked-up AI bot individually to not get lost in the weeds. Or, you know, we just blanket ban them. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
▲ | marginalia_nu 2 days ago | parent | prev | next [-] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Yeah, I agree with this. These types of roach motels have been around for decades and are at this point well understood and not much of a problem for anyone. You basically need to be able to deal with them to do any sort of large scale crawling. The reality of web crawling is that the web is already extremely adversarial and any crawler will get every imaginable nonsense thrown at it, ranging from various TCP tar pits, compression and XML bombs, really there's no end to what people will put online. A more resource effective technique to block misbehaving crawlers is to have a hidden link on each page, to some path forbidden via robots.txt, randomly generated perhaps so they're always unique. When that link is fetched, the server immediately drops the connection and blocks the IP for some time period. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
▲ | angoragoats a day ago | parent | prev | next [-] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
This may be true for large, established crawlers for Google, Bing, et al. I don’t see how you can make this a blanket statement for all crawlers, and my own personal experience tells me this isn’t correct. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
▲ | diggan 2 days ago | parent | prev | next [-] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
> There are already “infinite” websites like these on the Internet. Cool. And how much of the software driving these websites is FOSS and I can download and run it for my own (popular enough to be crawled more than daily by multiple scrapers) website? | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
▲ | qwe----3 2 days ago | parent | prev | next [-] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
This certainly violates the TOS for using Google. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
▲ | p0nce a day ago | parent | prev | next [-] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Brand new site with no user gets 1k request a month by bots, the CO2 cost must be atrocious. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
▲ | SrslyJosh 9 hours ago | parent | prev [-] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
> Source: I am the CEO of SerpApi. Credibility: zero. |