▲ | simoncion 5 days ago | |||||||||||||||||||||||||||||||
> But those aren't the ones that explicitly say "here is how to block us in robots.txt" Facebook attempted to legally acquire massive amounts of textual training data for their LLM development project. They discovered that acquiring this data in an aboveboard manner would be in part too expensive [0], and in part simply not possible [1]. Rather than either doing without this training data or generating new training data, Facebook decided to just pirate it. Regardless of whether you agree with my expectations, I hope you'll understand why I expect many-to-most companies in this section of the industry to publicly assert that they're behaving ethically, but do all sorts of shady shit behind the scenes. There's so much money sloshing around, and the penalties for doing intensely anti-social things in pursuit of that money are effectively nonexistent. [0] because of the expected total cost of licensing fees [1] in part because some copyright owners refused to permit the use, and in part because some copyright owners were impossible to contact for a variety of reasons | ||||||||||||||||||||||||||||||||
▲ | simonw 5 days ago | parent [-] | |||||||||||||||||||||||||||||||
I agree that AI companies do all sorts of shady stuff to accumulate training data. See Anthropic's recent lawsuit which I covered here: https://simonwillison.net/2025/Jun/24/anthropic-training/ That's why I care so much about differentiating between the shady stuff that they DO and the stuff that they don't. Saying "we will obey your robots.txt file" and lying about it is a different category of shady. I care about that difference. | ||||||||||||||||||||||||||||||||
|