Remix.run Logo
jimmaswell 7 days ago

What exactly is so bad about AI crawlers compared to Google or Bing? Is there more volume or is it just "I don't like AI"?

themafia 6 days ago | parent | next [-]

If you want my help training up your billion dollar model then you should pay me. My content is for humans. If you're not a human you are an unwelcome burden.

Search engines, at least, are designed to index the content, for the purpose of helping humans find it.

Language models are designed to filch content out of my website so it can reproduce it later without telling the humans where it came from or linking them to my site to find the source.

This is exactly the reason that "I just don't like 'AI'." You should ask the bot owners why they "just don't like appropriate copyright attribution."

jimmaswell 6 days ago | parent | next [-]

> copyright attribution

You can't copyright an idea, only a specific expression of an idea. An LLM works at the level of "ideas" (in essence - for example if you subtract the vector for "woman" from "man" and add the difference to "king" you get a point very close to "queen") and reproduces them in new contexts and makes its own connections to other ideas. It would be absurd for you to demand attribution and payment every time someone who read your Python blog said "Python is dynamically type-checked and garbage-collected". Thankfully that's not how the law works. Abusive traffic is a problem, but the world is a better place if humans can learn from these ideas with the help of ChatGPT et al. and to say they shouldn't be allowed to just because your ego demands credit for every idea someone learns from you is purely selfish.

heavyset_go 6 days ago | parent | next [-]

LLMs quite literally work at the level of their source material, that's how training works, that's how RAG works, etc.

There is no proof that LLMs work at the level of "ideas", if you could prove that, you'e solve a whole lot of incredibly expensive problems that are current bottlenecks for training and inference.

It is a bit ironic that you'd call someone wanting to control and be paid for the thing they themselves created "selfish", while at the same time writing apologia on why it's okay for a trillion dollar private company to steal someone else's work for their own profit.

It isn't some moral imperative that OpenAI gets access to all of humanity's creations so they can turn a profit.

6 days ago | parent | prev [-]
[deleted]
5 days ago | parent | prev | next [-]
[deleted]
6 days ago | parent | prev [-]
[deleted]
marvinborner 6 days ago | parent | prev | next [-]

As a reference on the volume aspect: I have a tiny server where I host some of my git repos. After the fans of my server spun increasingly faster/louder every week, I decided to log the requests [1]. In a single week, ClaudeBot made 2.25M (!) requests (7.55GiB), whereas GoogleBot made only 24 requests (8.37MiB). After installing Anubis the traffic went down to before the AI hype started.

[1] https://types.pl/@marvin/114394404090478296

squaresmile 6 days ago | parent [-]

Same, ClaudeBot makes a stupid amount of requests on my git storage. I just blocked them all on Cloudflare.

dilDDoS 6 days ago | parent | prev | next [-]

As others have said, it's definitely volume, but also the lack of respecting robots.txt. Most AI crawlers that I've seen bombarding our sites just relentlessly scrape anything and everything, without even checking to see if anything has changed since the last time they crawled the site.

benou 6 days ago | parent [-]

Yep, AI scrapers have been breaking our open-source project gerrit instance hosted at Linux Network Foundation.

Why this is the case while web-crawlers have been scrapping the web for the last 30 years is a mystery to me. This should be a solved problem. But it looks like this field is full of wrongly behaving companies with complete disregards toward common goods.

johnnyanmac 6 days ago | parent [-]

>Why this is the case while web-crawlers have been scrapping the web for the last 30 years is a mystery to me.

a mix of ignorance, greed, and a bit of the tragedy of the commons. If you don't respect anyone around you, you're not going to care about any rules or ettiquite that don't directly punish you. Society has definitely broken down over the decades.

Philpax 7 days ago | parent | prev | next [-]

Volume, primarily - the scrapers are running full-tilt, which many dynamic websites aren't designed to handle: https://pod.geraspora.de/posts/17342163

zahlman 6 days ago | parent | next [-]

Why not just actually rate-limit everyone, instead of slowing them down with proof-of-work?

NobodyNada 6 days ago | parent [-]

My understanding is that AI scrapers rotate IPs to bypass rate-limiting. Anubis requires clients to solve a proof-of-work challenge upon their first visit to the site to obtain a token that is tied to their IP and is valid for some number of requests -- thus forcing impolite scrapers to solve a new PoW challenge each time they rotate IPs, while being unobtrusive for regular users and scrapers that don't try to bypass rate limits.

It's like a secondary rate-limit on the ability of scrapers to rotate IPs, thus allowing your primary IP-based rate-limiting to remain effective.

Symbiote 6 days ago | parent [-]

Earlier today I found we'd served over a million requests to over 500,000 different IPs.

All had the same user agent (current Safari), they seem to be from hacked computers as the ISPs are all over the world.

The structure of the requests almost certainly means we've been specifically targeted.

But it's also a valid query, reasonably for normal users to make.

From this article, it looks like Proof of Work isn't going to be the solution I'd hoped it would be.

NobodyNada 6 days ago | parent [-]

The math in the article assumes scrapers only need one Anubis token per site, whereas a scraper using 500,000 IPs would require 500,000 tokens.

Scaling up the math in the article, which states it would take 6 CPU-minutes to generate enough tokens to scrape 11,508 Anubis-using websites, we're now looking at 4.3 CPU-hours to obtain enough tokens to scrape your website (and 50,000 CPU-hours to scrape the Internet). This still isn't all that much -- looking at cloud VM prices, that's around 10c to crawl your website and $1000 to crawl the Internet, which doesn't seem like a lot but it's much better than "too low to even measure".

However, the article observes Anubis's default difficulty can be solved in 30ms on a single-core server CPU. That seems unreasonably low to me; I would expect something like a second to be a more appropriate difficulty. Perhaps the server is benefiting from hardware accelerated sha256, whereas Anubis has to be fast enough on clients without it? If it's possible to bring the JavaScript PoW implementation closer to parity with a server CPU (maybe using a hash function designed to be expensive and hard to accelerate, rather than one designed to be cheap and easy to accelerate), that would bring the cost of obtaining 500k tokens up to 138 CPU-hours -- about $2-3 to crawl one site, or around $30,000 to crawl all Anubis deployments.

I'm somewhat skeptical of the idea of Anubis -- that cost still might be way too low, especially given the billions of VC dollars thrown at any company with "AI" in their sales pitch -- but I think the article is overly pessimistic. If your goal is not to stop scrapers, but rather to incentivize scrapers to be respectful by making it cheaper to abide by rate limits than it is to circumvent them, maybe Anubis (or something like it) really is enough.

(Although if it's true that AI companies really are using botnets of hacked computers, then Anubis is totally useless against bots smart enough to solve the challenges since the bots aren't paying for the CPU time.)

Symbiote 6 days ago | parent [-]

If the scraper scrapes from a small number of IPs they're easy to block or rate-limit. Rate-limits against this behaviour are fairly easy to implement, as are limits against non-human user agents, hence the botnet with browser user agents.

The Duke University Library analysis posted elsewhere in the discussion is promising.

I'm certain the botnets are using hacked/malwared computers, as the huge majority of requests come from ISPs and small hosting providers. It's probably more common for this to be malware, e.g. a program that streams pirate TV, or a 'free' VPN app, which joins the user's device to a botnet.

immibis 7 days ago | parent | prev [-]

Why haven't they been sued and jailed for DDoS, which is a felony?

ranger_danger 6 days ago | parent | next [-]

Criminal convictions in the US require a standard of proof that is "beyond a reasonable doubt" and I suspect cases like this would not pass the required mens rea test, as, in their minds at least (and probably a judge's), there was no ill intent to cause a denial of service... and trying to argue otherwise based on any technical reasoning (e.g. "most servers cannot handle this load and they somehow knew it") is IMO unlikely to sway the court... especially considering web scraping has already been ruled legal, and that a ToS clause against that cannot be legally enforced.

s1mplicissimus 6 days ago | parent | next [-]

coming from a different legal system so please forgive my ignorance: Is it necessary in the US to prove ill intent in order to sue for repairs? Just wondering, because when I accidentally punch someones tooth out, I would assume they certainly are entitled to the dentist bill.

johnnyanmac 6 days ago | parent | next [-]

>Is it necessary in the US to prove ill intent in order to sue for repairs?

As a general rule of thumb: you can sue anyone for anything in the US. There are even a few cases where someone tried to sue God: https://en.wikipedia.org/wiki/Lawsuits_against_supernatural_...

When we say "do we need" or "can we do" we're talking about the idea of how plausible it is to win case. A lawyer won't take a case with bad odds of winning, even if you want to pay extra because a part of their reputation lies on taking battles they feel they can win.

>because when I accidentally punch someones tooth out, I would assume they certainly are entitled to the dentist bill.

IANAL, so the boring answer is "it depends". reparations aren't guaranteed, but there's 50 different state laws to consider, on top of federal law.

Generally, they are not entitled to pay for damages themselves, but they may possibly be charged with battery. Intent will be a strong factor in winning the case.

Aachen 5 days ago | parent | prev [-]

Manslaughter vs. murder. Same act, different intent, different stigma, different punishment

heavyset_go 6 days ago | parent | prev | next [-]

There's an angle where criminal intent doesn't matter when it comes to negligence and damages. They have to had known that their scrapers would cause denial of service, unauthorized access, increased costs for operators, etc.

Aachen 5 days ago | parent | next [-]

That's not a certain outcome. If you're willing to do this case, I can provide access logs and any evidence you want. You can keep any money you win plus I'll pay a bonus on top! Wanna do it?

Keep in mind I'm in Germany, the server is in another EU country, and the worst scrapers overseas (in China, USA, and Singapore). Thanks to these LLMs there is no barrier to have the relevant laws be translated in all directions I trust that won't be a problem! :P

ranger_danger 6 days ago | parent | prev [-]

> criminal intent doesn't matter when it comes to negligence and damages

Are you a criminal defense attorney or prosecutor?

> They have to had known

IMO good luck convincing a judge of that... especially "beyond a reasonable doubt" as would be required for criminal negligence. They could argue lots of other scrapers operate just fine without causing problems, and that they tested theirs on other sites without issue.

slowmovintarget 6 days ago | parent | prev [-]

I thought only capital crimes (murder, for example) held the standard of beyond a reasonable doubt. Lesser crimes require the standard of either a "Preponderance of Evidence" or "Clear and Convincing Evidence" as burden of proof.

Still, even by those lesser standards, it's hard to build a case.

Majromax 6 days ago | parent | next [-]

It's civil cases that have the lower standard of proof. Civil cases arise when one party sues another, typically seeking money, and they are claims in equity, where the defendant is alleged to have harmed the plaintiff in some way.

Criminal cases require proof beyond a reasonable doubt. Most things that can result in jail time are criminal cases. Criminal cases are almost always brought by the government, and criminal acts are considered harm to society rather than to (strictly) an individual. In the US, criminal cases are classified as "misdemeanors" or "felonies," but that language is not universal in other jurisdictions.

slowmovintarget 6 days ago | parent [-]

Thank you.

eurleif 6 days ago | parent | prev [-]

No, all criminal convictions require proof beyond a reasonable doubt: https://constitution.congress.gov/browse/essay/amdt14-S1-5-5...

>Absent a guilty plea, the Due Process Clause requires proof beyond a reasonable doubt before a person may be convicted of a crime.

hdgvhicv 6 days ago | parent | next [-]

Proof or a guilty plea, which is often extracted from not guilty parties due to the lopsided environment of the courts

slowmovintarget 6 days ago | parent | prev [-]

Thank you.

Symbiote 6 days ago | parent | prev [-]

Many are using botnets, so it's not practical to find out who they are.

immibis 6 days ago | parent [-]

Then how do we know they are OpenAI?

ezrast 6 days ago | parent | prev | next [-]

High volume and inorganic traffic patterns. Wikimedia wrote about it here: https://diff.wikimedia.org/2025/04/01/how-crawlers-impact-th...

blibble 6 days ago | parent | prev [-]

they seem to be written by either idiots and/or people that don't give a shit about being good internet citizens

either way the result is the same: they induce massive load

well written crawlers will:

  - not hit a specific ip/host more frequently than say 1 req/5s
  - put newly discovered URLs at the end of a distributed queue (NOT do DFS per domain)
  - limit crawling depth based on crawled page quality and/or response time
  - respect robots.txt
  - make it easy to block them
Aachen 5 days ago | parent | next [-]

- wait 2 seconds for a page to load before aborting the connection

- wait for the previous request to finish before requesting the next page, since that would only induce more load, get even slower, and eventually take everything down

I've designed my site to hold up to traffic spikes anyway and the bots I'm getting aren't as crazy as the ones I hear about from other, bigger website operators (like the OpenStreetMap wiki, still pretty niche), so I don't block much of them. Can't vet every visitor so they'll get the content anyway, whether I like it or not. But if I see a bot having HTTP 499 "client went away before page finished loading" entries in the access log, I'm not wasting my compute on those assholes. That's a block. I haven't had to do that before, in a decade of hosting my own various tools and websites

6 days ago | parent | prev [-]
[deleted]