There are two common misconceptions in this post.

The first isn't worth arguing against: it's the idea that LLM vendors ignore your robots.txt file even when they clearly state that they'll obey it: https://platform.openai.com/docs/bots

Since LLM skeptics frequently characterize all LLM vendors as dishonest mustache-twirling cartoon villains there's little point trying to convince them that companies sometimes actually do what they say they are doing.

The bigger misconception though is the idea that LLM training involves indiscriminately hoovering up every inch of text that the lab can get hold of, quality be damned. As far as I can tell that hasn't been true since the GPT-3 era.

Building a great LLM is entirely about building a high quality training set. That's the whole game! Filtering out garbage articles full of spelling mistakes is one of many steps a vendor will take in curating that training data.

▲ vintermann 5 days ago | parent | next [-]

There are definitively scrapers that ignore your robots.txt file. Whether they're some "Enemy State" LLM outfit, an "Allied State" corporation outsourcing their dirty work a step or two, or just some data hoarder worried that the web as we know it is going away soon, everyone is saying they're a problem lately, I don't think everyone is lying.

But it's certainly also true that anyone feeding the scrapings to an LLM will filter it first. It's very naive of this author to think that his adlib-spun prose won't get detected and filtered out long before it's used for training. Even the pre-LLM internet had endless pages of this sort of thing, from aspiring SEO spammers. Yes, you're wasting a bit of the scraper's resources, but you can bet they're already calculating in that waste.

▲ simonw 5 days ago | parent [-]

There are definitively scrapers that ignore your robots.txt file

Of course. But those aren't the ones that explicitly say "here is how to block us in robots.txt"

The exact quote from the article that I'm pushing back on here is:

"If you think these crawlers respect robots.txt then you are several assumptions of good faith removed from reality."

Which appears directly below this:

  User-agent: GPTBot
  Disallow: /

▲ simoncion 5 days ago | parent | next [-]

> But those aren't the ones that explicitly say "here is how to block us in robots.txt"

Facebook attempted to legally acquire massive amounts of textual training data for their LLM development project. They discovered that acquiring this data in an aboveboard manner would be in part too expensive [0], and in part simply not possible [1]. Rather than either doing without this training data or generating new training data, Facebook decided to just pirate it.

Regardless of whether you agree with my expectations, I hope you'll understand why I expect many-to-most companies in this section of the industry to publicly assert that they're behaving ethically, but do all sorts of shady shit behind the scenes. There's so much money sloshing around, and the penalties for doing intensely anti-social things in pursuit of that money are effectively nonexistent.

[0] because of the expected total cost of licensing fees

[1] in part because some copyright owners refused to permit the use, and in part because some copyright owners were impossible to contact for a variety of reasons

▲ simonw 5 days ago | parent [-]

I agree that AI companies do all sorts of shady stuff to accumulate training data. See Anthropic's recent lawsuit which I covered here: https://simonwillison.net/2025/Jun/24/anthropic-training/

That's why I care so much about differentiating between the shady stuff that they DO and the stuff that they don't. Saying "we will obey your robots.txt file" and lying about it is a different category of shady. I care about that difference.

▲ simoncion 2 days ago | parent | next [-]

> That's why I care so much about differentiating between the shady stuff that they DO and the stuff that they don't.

Ah, good. So you have solid evidence that they're NOT doing shady stuff. Great! Let's have it.

"It's unfair to require me to prove a negative!" you say? Sure, that's a fair objection... but my counter to that is "We'll only get solid evidence of dirty dealings if an insider turns stool pidgeon.". So, given that we're certainly not going to get solid evidence, we must base our evaluation on the behavior of the companies in other big projects.

Over the past few decades, Google, Facebook, and Microsoft have not demonstrated that they're dedicated to behaving ethically. (And their behavior has gotten far, far worse over the past few years.) OpenAI's CEO is plainly and obviously a manipulator and savvy political operator. (Remember how he once declared that it was vitally important that he could be fired?) Anthropic's CEO just keeps lying to the press [0] in order to keep fueling AGI hype.

[0] Oh, pardon me. He's "making a large volume of forward-looking statements that -due to ever-evolving market conditions- turn out to be inaccurate". I often get that concept confused with "lying". My bad.

▲ simonw a day ago | parent [-]

So call them out for the bad stuff! Don't distract from the genuine problems by making up stuff about them ignoring robots.txt directives despite their documentation clearly explaining how those are handled.

	▲	simoncion a day ago \| parent [-]
		> So call them out for the bad stuff! I am. I am also saying -because the companies involved have demonstrated that they're either frequently willing to do things that are scummy as shit or "just" have executives that make a habit of lying to the press in order to keep the hype train rolling- that's it's very, very likely that they're quietly engaging in antisocial behavior in order to make development of their projects some combination of easier, quicker, or cheaper. > Don't distract from the genuine problems by making up stuff... Right back at you. You said: `> There are definitively scrapers that ignore your robots.txt file Of course. But those aren't the ones that explicitly say "here is how to block us in robots.txt"` But, you don't have any proof of that. This is pure speculation on your part. Given the frequency of and degree to which the major companies involved in this ongoing research project engage in antisocial behavior [0] it's more likely than not that they are doing shady shit. As I mentioned, there's a ton of theoretical money on the line. The unfortunate thing for us is that neither of us can do anything other than speculate... unless an insider turns informant. [0] ...and given how the expected penalties for engaging in most of the antisocial behavior that's relevant to the AI research project is somewhere between "absolutely nothing at all" and "maybe six to twelve months of expected revenue"...

▲ cyphar 5 days ago | parent | prev [-]

Maybe I'm the outlier here, but I think intentionally torrenting millions of books and taking great pains to try to avoid linking the activity to your company is far beyond something as "trivial" as ignoring robots.txt. This is like wringing your hands over whether a serial killer also jaywalked on their way to the crimescene.

(In theory the former is supposed to be a capital-C criminal offence -- felony copyright infringement.)

▲ spacebuffer 5 days ago | parent | prev [-]

Your initial comment made sense after reading through the openai docpage. so I opened up my site to add those to robots.txt, turns out I had added all 3 of those user-agents to my robots file [0], out of curiosity I asked chatgpt about my site and it did scrape it, it even mentioned articles that have been published after adding the robots file

[0]: https://yusuf.fyi/robots.txt

▲

simonw 4 days ago | parent [-]

Can you share that ChatGPT transcript?

One guess: ChatGPT uses Bing for search queries, and your robots.txt doesn't block Bing. If that's what is happening here I agree that this is really confusing and should be clarified by the OpenAI bots page.

▲

spacebuffer 4 days ago | parent [-]

Here you go: https://chatgpt.com/share/68bc125a-9e9c-8005-9c9f-298dbd541d...

	▲	simonw 4 days ago \| parent [-]
		Yeah, wow that's a lot of information for a site that's supposedly blocked using robots.txt! My best guess is that this is a Bing thing - ChatGPT uses Bing as their search partner (though they don't make that very obvious at all), and BingBot isnt't blocked by your site. I think OpenAI need to be a whole lot more transparent about this. It's very misleading to block their crawlers and have it not make any difference at all to the search results returns within ChatGPT.

▲ Retric 5 days ago | parent | prev | next [-]

> The first isn't worth arguing against: it's the idea that LLM vendors ignore your robots.txt file even when they clearly state that they'll obey it:

That’s testable and you can find content “protected” by robots.txt regurgitated by LLM’s. In practice it doesn’t matter if that’s through companies lying or some 3rd party scraping your content and then getting scraped.

▲

simonw 5 days ago | parent | next [-]

There's a subtle but important difference between crawling data to train a model and accessing data as part of responding to a prompt and then piping that content into the context in order to summarize it (which may be what you mean by "regurgitation" here, I'm not sure.)

I think that distinction is lost on a lot of people, which is understandable.

▲

simonw 5 days ago | parent | prev [-]

Do you have an example that demonstrates that?

▲

whilenot-dev 5 days ago | parent [-]

User Agent "Perplexity‑User"[0]:

> Since a user requested the fetch, this fetcher generally ignores robots.txt rules.

[0]: https://docs.perplexity.ai/guides/bots

▲

Lerc 5 days ago | parent | next [-]

There's definitely a distinction between fetching data for training and fetching data as an agent on behalf of a user. I guess you could demand that any program that identifies itself as a user agent should be blocked, but it seems counterproductive.

	▲	theamk 5 days ago \| parent [-]
		Counterproductive for what? If I am writing for entertainment value, I see no problem with blocking all AI agents - the goal of text is to be read by humans after all. For technical texts, one might want to block AI agents as well - they often omit critical parts and hallucinate. If you want your "DON'T DO THIS" sections to be read, better block them.

▲

nine_k 5 days ago | parent | prev [-]

But this is more like `curl https://some/url/...` ignoring robots.txt.

Crawlers are the thing that should honor robots.txt, "nofollow", etc.

▲

whilenot-dev 5 days ago | parent [-]

The title of this site is "Perplexity Crawlers"...

And it clearly voids simonw's stance

> it's the idea that LLM vendors ignore your robots.txt file even when they clearly state that they'll obey it

▲

simonw 5 days ago | parent [-]

Where did Perplexity say they would obey robots.txt?

They explicitly document that they do not obey robots.txt for that form of crawling (user-triggered, data is not gathered for training.)

Their documentation is very clear: https://docs.perplexity.ai/guides/bots

▲

whilenot-dev 5 days ago | parent [-]

I thought your emphasis was on the "it's the idea that LLM vendors ignore your robots.txt file" part of your statement... Now your point is it's okay for them to ignore it because they announce that their crawler "ignores robots.txt rules"?

▲

simonw 5 days ago | parent [-]

No, my argument here is that it's not OK to say "but OpenAI are obviously lying about following robots.txt for their training crawler" when their documentation says they obey robots.txt.

There's plenty to criticize AI companies for. I think it's better to stick to things that are true.

▲

latexr 5 days ago | parent | next [-]

> when their documentation says

Their documentation could say Sam Altman is the queen of England. It wouldn’t make it true. OpenAI has been repeatedly caught lying about respecting robots.txt.

https://www.businessinsider.com/openai-anthropic-ai-ignore-r...

https://web.archive.org/web/20250802052421/https://mailman.n...

https://www.reddit.com/r/AskProgramming/comments/1i15gxq/ope...

	▲	simonw 5 days ago \| parent [-]
		Those three links don't support your argument here. The Business Insider one is a paywalled rehash of this Reuters story https://www.reuters.com/technology/artificial-intelligence/m... - which was itself a report based on some data-driven PR by a startup, TollBit, who sell anti-scraping technology. Here's that report: https://tollbit.com/bots/24q4/ I downloaded a copy and found it actually says "OpenAI respects the signals provided by content owners via robots.txt allowing them to disallow any or all of its crawlers". I don't know where the idea that TollBit say OpenAI don't obey robots.txt comes from. The second one is someone saying that their site which didn't use robots.txt was aggressively crawled. The third one claims to prove OpenAI are ignoring robots.txt but shows request logs for user-agent ChatGPT-User which is NOT the same thing as GPTBot, as documented on https://platform.openai.com/docs/bots

▲

whilenot-dev 5 days ago | parent | prev [-]

Agree, and in this same vein TFA was talking about LLMs in general, not OpenAI specifically. While I get your concern and would also like to avoid any sensationalism, there is still this itch about the careful wording in all these company statements.

For example, who's the "user" that "ask[s] Perplexity a question" here? Putting on my software engineer hat with its urge for automation, it could very well be that Perplexity maintains a list of all the sites blocked for the PerplexityBot user agent through robots.txt rules. Such a list would help for crawling optimization, but could also be used to later have any employer asking Perplexity a certain question that would attempt to re-crawl the site with the Perplexity‑User user agent anyway (the one ignoring robot.txt rules). Call it the work of the QA department.

Unless we'd work for such a company in a high position we'd never really know, and the existing violations of trust - just in regard to copyrighted works alone(!) - is enough rightful reason to keep a certain mistrust by default when it comes to young companies that are already evaluated in the billions and the handling of their most basic resources.

▲ rozab 5 days ago | parent | prev | next [-]

After I set up a self hosted git forge a little while ago, I found that within minutes it immediately got hammered by OpenAI, Anthropic, etc. They were extremely aggressive, grabbing every individual file from every individual commit, one at a time.

I hadn't backlinked the site anywhere and was just testing, so I hadn't thought to put up a robots.txt. They must have found me through my cert registration.

After I put up my robots.txt (with explicit UA blocks instead of wildcards, I heard some ignore them), I found after a day or so the scraping stopped completely. The only ones I get now are vulnerability scanners, or random spiders taking just the homepage.

I know my site is of no consequence, but for those claiming OpenAI et al ignore robots.txt I would really like to see some evidence. They are evil and disrespectful and I'm gutted they stole my code for profit, but I'm still sceptical of these claims.

Cloudflare have done lots of work here and have never mentioned crawlers ignoring robots.txt:

https://blog.cloudflare.com/control-content-use-for-ai-train...

▲ CrossVR 5 days ago | parent | prev | next [-]

> Since LLM skeptics frequently characterize all LLM vendors as dishonest mustache-twirling cartoon villains there's little point trying to convince them that companies sometimes actually do what they say they are doing.

Even if the large LLM vendors respect it, there's enough venture capital going around that plenty of smaller vendors are attempting to train their own LLMs and they'll take every edge they can get, robots.txt be damned.

	▲	simonw 5 days ago \| parent [-]
		Yeah this is definitely true.

▲ flir 5 days ago | parent | prev | next [-]

> The first isn't worth arguing against: it's the idea that LLM vendors ignore your robots.txt file even when they clearly state that they'll obey it: https://platform.openai.com/docs/bots

So, uh... where's all the extra traffic coming from?

▲

simonw 5 days ago | parent [-]

All of the badly behaved crawlers.

	▲	flir 5 days ago \| parent [-]
		Yeah, I read the rest of the conversation and tried to delete. I understand your point now. Apologies.

▲ hooloovoo_zoo 5 days ago | parent | prev | next [-]

Your link does not say they will obey it.

▲

simonw 5 days ago | parent [-]

Direct quote from https://platform.openai.com/docs/bots

"OpenAI uses the following robots.txt tags to enable webmasters to manage how their sites and content work with AI."

Then for GPT it says:

"GPTBot is used to make our generative AI foundation models more useful and safe. It is used to crawl content that may be used in training our generative AI foundation models. Disallowing GPTBot indicates a site’s content should not be used in training generative AI foundation models."

What are you seeing here that I'm missing?

	▲	hooloovoo_zoo 5 days ago \| parent [-]
		My read is that they are describing functionality for site owners to provide input about what the site owner thinks should happen. OpenAI is not promising that is what WILL happen, even in the narrow context of that specific bot.

▲ charles_f 5 days ago | parent | prev [-]

I somewhat agree with your viewpoint on copyright, but what terrifies me is VCs like a16z or Sequoia simultaneously backing up large LLMs profiting from ignoring copyright and media firms where they'll use whatever power and lobby they have to protect copyright.

I don't think the content I produce is worth that much, I'm glad if it can serve anyone, but I find amusing the idea to poison the well