Remix.run Logo
simonw 5 days ago

There are definitively scrapers that ignore your robots.txt file

Of course. But those aren't the ones that explicitly say "here is how to block us in robots.txt"

The exact quote from the article that I'm pushing back on here is:

"If you think these crawlers respect robots.txt then you are several assumptions of good faith removed from reality."

Which appears directly below this:

  User-agent: GPTBot
  Disallow: /
simoncion 5 days ago | parent | next [-]

> But those aren't the ones that explicitly say "here is how to block us in robots.txt"

Facebook attempted to legally acquire massive amounts of textual training data for their LLM development project. They discovered that acquiring this data in an aboveboard manner would be in part too expensive [0], and in part simply not possible [1]. Rather than either doing without this training data or generating new training data, Facebook decided to just pirate it.

Regardless of whether you agree with my expectations, I hope you'll understand why I expect many-to-most companies in this section of the industry to publicly assert that they're behaving ethically, but do all sorts of shady shit behind the scenes. There's so much money sloshing around, and the penalties for doing intensely anti-social things in pursuit of that money are effectively nonexistent.

[0] because of the expected total cost of licensing fees

[1] in part because some copyright owners refused to permit the use, and in part because some copyright owners were impossible to contact for a variety of reasons

simonw 5 days ago | parent [-]

I agree that AI companies do all sorts of shady stuff to accumulate training data. See Anthropic's recent lawsuit which I covered here: https://simonwillison.net/2025/Jun/24/anthropic-training/

That's why I care so much about differentiating between the shady stuff that they DO and the stuff that they don't. Saying "we will obey your robots.txt file" and lying about it is a different category of shady. I care about that difference.

simoncion 2 days ago | parent | next [-]

> That's why I care so much about differentiating between the shady stuff that they DO and the stuff that they don't.

Ah, good. So you have solid evidence that they're NOT doing shady stuff. Great! Let's have it.

"It's unfair to require me to prove a negative!" you say? Sure, that's a fair objection... but my counter to that is "We'll only get solid evidence of dirty dealings if an insider turns stool pidgeon.". So, given that we're certainly not going to get solid evidence, we must base our evaluation on the behavior of the companies in other big projects.

Over the past few decades, Google, Facebook, and Microsoft have not demonstrated that they're dedicated to behaving ethically. (And their behavior has gotten far, far worse over the past few years.) OpenAI's CEO is plainly and obviously a manipulator and savvy political operator. (Remember how he once declared that it was vitally important that he could be fired?) Anthropic's CEO just keeps lying to the press [0] in order to keep fueling AGI hype.

[0] Oh, pardon me. He's "making a large volume of forward-looking statements that -due to ever-evolving market conditions- turn out to be inaccurate". I often get that concept confused with "lying". My bad.

simonw a day ago | parent [-]

So call them out for the bad stuff! Don't distract from the genuine problems by making up stuff about them ignoring robots.txt directives despite their documentation clearly explaining how those are handled.

simoncion a day ago | parent [-]

> So call them out for the bad stuff!

I am. I am also saying -because the companies involved have demonstrated that they're either frequently willing to do things that are scummy as shit or "just" have executives that make a habit of lying to the press in order to keep the hype train rolling- that's it's very, very likely that they're quietly engaging in antisocial behavior in order to make development of their projects some combination of easier, quicker, or cheaper.

> Don't distract from the genuine problems by making up stuff...

Right back at you. You said:

  > There are definitively scrapers that ignore your robots.txt file
  Of course. But those aren't the ones that explicitly say "here is how to block us in robots.txt"
But, you don't have any proof of that. This is pure speculation on your part. Given the frequency of and degree to which the major companies involved in this ongoing research project engage in antisocial behavior [0] it's more likely than not that they are doing shady shit. As I mentioned, there's a ton of theoretical money on the line.

The unfortunate thing for us is that neither of us can do anything other than speculate... unless an insider turns informant.

[0] ...and given how the expected penalties for engaging in most of the antisocial behavior that's relevant to the AI research project is somewhere between "absolutely nothing at all" and "maybe six to twelve months of expected revenue"...

cyphar 5 days ago | parent | prev [-]

Maybe I'm the outlier here, but I think intentionally torrenting millions of books and taking great pains to try to avoid linking the activity to your company is far beyond something as "trivial" as ignoring robots.txt. This is like wringing your hands over whether a serial killer also jaywalked on their way to the crimescene.

(In theory the former is supposed to be a capital-C criminal offence -- felony copyright infringement.)

spacebuffer 5 days ago | parent | prev [-]

Your initial comment made sense after reading through the openai docpage. so I opened up my site to add those to robots.txt, turns out I had added all 3 of those user-agents to my robots file [0], out of curiosity I asked chatgpt about my site and it did scrape it, it even mentioned articles that have been published after adding the robots file

[0]: https://yusuf.fyi/robots.txt

simonw 4 days ago | parent [-]

Can you share that ChatGPT transcript?

One guess: ChatGPT uses Bing for search queries, and your robots.txt doesn't block Bing. If that's what is happening here I agree that this is really confusing and should be clarified by the OpenAI bots page.

spacebuffer 4 days ago | parent [-]

Here you go: https://chatgpt.com/share/68bc125a-9e9c-8005-9c9f-298dbd541d...

simonw 4 days ago | parent [-]

Yeah, wow that's a lot of information for a site that's supposedly blocked using robots.txt!

My best guess is that this is a Bing thing - ChatGPT uses Bing as their search partner (though they don't make that very obvious at all), and BingBot isnt't blocked by your site.

I think OpenAI need to be a whole lot more transparent about this. It's very misleading to block their crawlers and have it not make any difference at all to the search results returns within ChatGPT.