Remix.run Logo
DeepYogurt 5 hours ago

Has anyone done a talk/blog/whatever on how llm crawlers are different than classical crawlers? I'm not up on the difference.

btown 4 hours ago | parent | next [-]

IMO there was something of a de facto contract, pre-LLMs, that the set of things one would publicly mirror/excerpt/index and the set of things one would scrape were one and the same.

Back then, legitimate search engines wouldn’t want to scrape things that would just make their search results less relevant with garbage data anyways, so by and large they would honor robots.txt and not overwhelm upstream servers. Bad actors existed, of course, but were very rarely backed by companies valued in the billions of dollars.

People training foundation models now have no such constraints or qualms - they need as many human-written sentences as possible, regardless of the context in which they are extracted. That’s coupled with a broader familiarity with ubiquitous residential proxy providers that can tunnel traffic through consumer connections worldwide. That’s an entirely different social contract, one we are still navigating.

wredcoll 2 hours ago | parent | next [-]

For all its sins, google had a vested interest in the sites it was linking to stay alive. Llms don't.

eric-burel 6 minutes ago | parent [-]

That's a shortcut, llm providers are very short sighted but not to that extreme, alive websites are needed to produce new data for future trainings. Edit: damn I've seen this movie before

stephenitis 3 hours ago | parent | prev | next [-]

Text, images, video, all of it I can’t think of any form of data they don’t want to scoop up, other than noise and poisoned data

cwbriscoe 2 hours ago | parent | prev [-]

I am not well versed in this problem but can't the web servers rate limit by known IP addresses of these crawler/scrapers?

Yoric 7 minutes ago | parent | next [-]

Not the exact same problem, but a few months ago, I tried to block youtube traffic from my home (I was writing a parental app for my child) by IP. After a few hours of trying to collect IPs, I gave up, realizing that YouTube was dynamically load-balanced across millions of IPs, some of which also served traffic from other Google services I didn't want to block.

I wouldn't be surprised if it was the same with LLMs. Millions of workers allocated dynamically on AWS, with varying IPs.

In my specific case, as I was dealing with browser-initiated traffic, I wrote a Firefox add-on instead. No such shortcut for web servers, though.

strogonoff an hour ago | parent | prev | next [-]

You cannot block LLM crawlers by IP address, because some of them use residential proxies. Source: 1) a friend admins a slightly popular site and has decent bot detection heuristics, 2) just Google “residential proxy LLM”, they are not exactly hiding. Strip-mining original intellectual property for commercial usage is big business.

skrebbel an hour ago | parent | next [-]

How does this work? Why would people let randos use their home internet connections? I googled it but the companies selling these services are not exactly forthcoming on how they obtained their "millions of residential IP addresses".

Are these botnets? Are AI companies mass-funding criminal malware companies?

fakwandi_priv 11 minutes ago | parent | next [-]

It used to be Hola VPN which would let you use someone else’s connection and in the same way someone could use yours which was communicated transparently, that same hola client would also route business users. Im sure many other free VPN clients do the same thing nowadays.

joha4270 35 minutes ago | parent | prev | next [-]

I have seen it claimed that's a way of monetizing free phone apps. Just bundle a proxy and get paid for that.

cuu508 16 minutes ago | parent [-]

A recent HN thread about this: https://news.ycombinator.com/item?id=45746156

stackghost 44 minutes ago | parent | prev [-]

>Are these botnets? Are AI companies mass-funding criminal malware companies?

Without a doubt some of them are botnets. AI companies got their initial foothold by violating copyright en masse with pirated textbook dumps for training data, and whatnot. Why should they suddenly develop scruples now?

globalnode 37 minutes ago | parent | prev [-]

so user either has a malware proxy running requests without being noticed or voluntarily signed up as a proxy to make extra $ off their home connection. Either way I dont care if their IP is blocked. Only problem is if users behind CGNAT get their IP blocked then legitimate users may later be blocked.

ninja3925 an hour ago | parent | prev [-]

Large cloud providers could offer that solution but then, crawlers can also change cycle IPs

phantomathkg 4 hours ago | parent | prev | next [-]

Yes

https://diff.wikimedia.org/2025/04/01/how-crawlers-impact-th...

https://www.usebox.net/jjm/blog/the-problem-of-the-llm-crawl...

klodolph 5 hours ago | parent | prev | next [-]

The only real difference that LLM crawlers tend to not respect /robots.txt and some of them hammer sites with some pretty heavy traffic.

The trap in the article has a link. Bots are instructed not to follow the link. The link is normally invisible to humans. A client that visits the link is probably therefore a poorly behaved bot.

superkuh 4 hours ago | parent | prev [-]

Recently there have been more crawlers coming from tens to hundreds of IP netblocks from dozens (or more!) of ASN in highly time and URL correlated fashion with spoofed user-agent(s) and no regard for rate or request limiting or robots.txt. These attempt to visit every possible permutation of URLs on the domain and have a lot of bandwidth and established tcp connections available to them. It's not that this didn't happen pre-2023 but it's noticably more common now. If you have a public webserver you've probably experienced it at least once.

Actual LLM involvement as the requesting user-agent is vanishingly small. It's the same problem as ever: corporations, their profit motive during $hypecycle coupled with access to capital for IT resources, and the protection of the abusers via the company's abstraction away of legal liability for their behavior.