Remix.run Logo
chupchap 6 hours ago

Bot traffic is crazy even for smaller sites, but still manageable. I was getting 2,000 visitors a day on my infrequently updated website, but after I blocked all the bots via Cloudflare it went back to the normal double digit visitor count.

drcongo an hour ago | parent | next [-]

One day last week one of my clients' sites was getting about 2k "visitors" per second - I had to block the entire AS45102 to make it stop.

spiderfarmer 5 hours ago | parent | prev [-]

I have 6M pages across 8 domains. I have 10 unique IP residential bots per second working hard to scrape every single page.

dylan604 4 hours ago | parent [-]

and how often are those 6M pages changing? how often are those bots finding anything new? why are the bot makers not noticing no difference and just slowing the request down for essentially stale content to them

mrweasel 4 hours ago | parent | next [-]

In March 2025, Drew DeVault wrote a blog post called "Please stop externalizing your costs directly into my face"[1]. I think that is a pretty good guess as to why these bots do not care about frequency of changes, it costs to much.

Every run is basically a fresh run, no state stored, every page is just feed into the machine a new. At least that's my theory.

The AI companies need a full copy of your page, every time they retrain a model. Now they could store that in their own datacenters, but that's a full copy of the internet, in a market where storage costs are already pretty high. So instead, they just externalize the storage cost. If you run a website, a public Gitlab instance, Forgejo, a wiki, a forum, whatever, you basically functions as free offsite storage for the AI companies.

1) https://drewdevault.com/2025/03/17/2025-03-17-Stop-externali...

lovehashbrowns 3 hours ago | parent | prev [-]

On the platform at my work they scrape the same page multiple times, over and over. They do not care to cache anything. And it’s ridiculous to account for because for example for our properties, everything is news-based so warming the cache was as simple as loading the first X articles to get them into cache. But with AI that is not viable because they scrape as much as possible, articles from 2018, 2017. Management doesn’t want to block them though. It’s just suffering through the endless barrage. I was able to do a lot for this like heavier caching even with pgpool but it’s so crazy that this small subset of bots effectively accounts for like 60%+ of our spend.

spiderfarmer 3 hours ago | parent [-]

Many are using residential proxies now. It's impossible to block them. Not even Google Analytics succeeds. People are sitting on reports thinking their website is suddenly very popular, but it's all random ips, from random locations across the world requesting 1 page at a time, at random times of the day.