and how often are those 6M pages changing? how often are those bots finding anything new? why are the bot makers not noticing no difference and just slowing the request down for essentially stale content to them

▲

mrweasel 3 hours ago | parent | next [-]

In March 2025, Drew DeVault wrote a blog post called "Please stop externalizing your costs directly into my face"[1]. I think that is a pretty good guess as to why these bots do not care about frequency of changes, it costs to much.

Every run is basically a fresh run, no state stored, every page is just feed into the machine a new. At least that's my theory.

The AI companies need a full copy of your page, every time they retrain a model. Now they could store that in their own datacenters, but that's a full copy of the internet, in a market where storage costs are already pretty high. So instead, they just externalize the storage cost. If you run a website, a public Gitlab instance, Forgejo, a wiki, a forum, whatever, you basically functions as free offsite storage for the AI companies.

1) https://drewdevault.com/2025/03/17/2025-03-17-Stop-externali...

▲

lovehashbrowns 3 hours ago | parent | prev [-]

On the platform at my work they scrape the same page multiple times, over and over. They do not care to cache anything. And it’s ridiculous to account for because for example for our properties, everything is news-based so warming the cache was as simple as loading the first X articles to get them into cache. But with AI that is not viable because they scrape as much as possible, articles from 2018, 2017. Management doesn’t want to block them though. It’s just suffering through the endless barrage. I was able to do a lot for this like heavier caching even with pgpool but it’s so crazy that this small subset of bots effectively accounts for like 60%+ of our spend.

	▲	spiderfarmer 3 hours ago \| parent [-]
		Many are using residential proxies now. It's impossible to block them. Not even Google Analytics succeeds. People are sitting on reports thinking their website is suddenly very popular, but it's all random ips, from random locations across the world requesting 1 page at a time, at random times of the day.