> But those aren't the ones that explicitly say "here is how to block us in robots.txt"

Facebook attempted to legally acquire massive amounts of textual training data for their LLM development project. They discovered that acquiring this data in an aboveboard manner would be in part too expensive [0], and in part simply not possible [1]. Rather than either doing without this training data or generating new training data, Facebook decided to just pirate it.

Regardless of whether you agree with my expectations, I hope you'll understand why I expect many-to-most companies in this section of the industry to publicly assert that they're behaving ethically, but do all sorts of shady shit behind the scenes. There's so much money sloshing around, and the penalties for doing intensely anti-social things in pursuit of that money are effectively nonexistent.

[0] because of the expected total cost of licensing fees

[1] in part because some copyright owners refused to permit the use, and in part because some copyright owners were impossible to contact for a variety of reasons

▲ simonw 5 days ago | parent [-]

I agree that AI companies do all sorts of shady stuff to accumulate training data. See Anthropic's recent lawsuit which I covered here: https://simonwillison.net/2025/Jun/24/anthropic-training/

That's why I care so much about differentiating between the shady stuff that they DO and the stuff that they don't. Saying "we will obey your robots.txt file" and lying about it is a different category of shady. I care about that difference.

▲ simoncion 2 days ago | parent | next [-]

> That's why I care so much about differentiating between the shady stuff that they DO and the stuff that they don't.

Ah, good. So you have solid evidence that they're NOT doing shady stuff. Great! Let's have it.

"It's unfair to require me to prove a negative!" you say? Sure, that's a fair objection... but my counter to that is "We'll only get solid evidence of dirty dealings if an insider turns stool pidgeon.". So, given that we're certainly not going to get solid evidence, we must base our evaluation on the behavior of the companies in other big projects.

Over the past few decades, Google, Facebook, and Microsoft have not demonstrated that they're dedicated to behaving ethically. (And their behavior has gotten far, far worse over the past few years.) OpenAI's CEO is plainly and obviously a manipulator and savvy political operator. (Remember how he once declared that it was vitally important that he could be fired?) Anthropic's CEO just keeps lying to the press [0] in order to keep fueling AGI hype.

[0] Oh, pardon me. He's "making a large volume of forward-looking statements that -due to ever-evolving market conditions- turn out to be inaccurate". I often get that concept confused with "lying". My bad.

▲ simonw a day ago | parent [-]

So call them out for the bad stuff! Don't distract from the genuine problems by making up stuff about them ignoring robots.txt directives despite their documentation clearly explaining how those are handled.

	▲	simoncion a day ago \| parent [-]
		> So call them out for the bad stuff! I am. I am also saying -because the companies involved have demonstrated that they're either frequently willing to do things that are scummy as shit or "just" have executives that make a habit of lying to the press in order to keep the hype train rolling- that's it's very, very likely that they're quietly engaging in antisocial behavior in order to make development of their projects some combination of easier, quicker, or cheaper. > Don't distract from the genuine problems by making up stuff... Right back at you. You said: `> There are definitively scrapers that ignore your robots.txt file Of course. But those aren't the ones that explicitly say "here is how to block us in robots.txt"` But, you don't have any proof of that. This is pure speculation on your part. Given the frequency of and degree to which the major companies involved in this ongoing research project engage in antisocial behavior [0] it's more likely than not that they are doing shady shit. As I mentioned, there's a ton of theoretical money on the line. The unfortunate thing for us is that neither of us can do anything other than speculate... unless an insider turns informant. [0] ...and given how the expected penalties for engaging in most of the antisocial behavior that's relevant to the AI research project is somewhere between "absolutely nothing at all" and "maybe six to twelve months of expected revenue"...

▲ cyphar 5 days ago | parent | prev [-]

Maybe I'm the outlier here, but I think intentionally torrenting millions of books and taking great pains to try to avoid linking the activity to your company is far beyond something as "trivial" as ignoring robots.txt. This is like wringing your hands over whether a serial killer also jaywalked on their way to the crimescene.

(In theory the former is supposed to be a capital-C criminal offence -- felony copyright infringement.)