▲ | simonw 5 days ago | ||||||||||||||||||||||||||||||||||||||||
There are definitively scrapers that ignore your robots.txt file Of course. But those aren't the ones that explicitly say "here is how to block us in robots.txt" The exact quote from the article that I'm pushing back on here is: "If you think these crawlers respect robots.txt then you are several assumptions of good faith removed from reality." Which appears directly below this:
| |||||||||||||||||||||||||||||||||||||||||
▲ | simoncion 5 days ago | parent | next [-] | ||||||||||||||||||||||||||||||||||||||||
> But those aren't the ones that explicitly say "here is how to block us in robots.txt" Facebook attempted to legally acquire massive amounts of textual training data for their LLM development project. They discovered that acquiring this data in an aboveboard manner would be in part too expensive [0], and in part simply not possible [1]. Rather than either doing without this training data or generating new training data, Facebook decided to just pirate it. Regardless of whether you agree with my expectations, I hope you'll understand why I expect many-to-most companies in this section of the industry to publicly assert that they're behaving ethically, but do all sorts of shady shit behind the scenes. There's so much money sloshing around, and the penalties for doing intensely anti-social things in pursuit of that money are effectively nonexistent. [0] because of the expected total cost of licensing fees [1] in part because some copyright owners refused to permit the use, and in part because some copyright owners were impossible to contact for a variety of reasons | |||||||||||||||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||||||||||||||
▲ | spacebuffer 5 days ago | parent | prev [-] | ||||||||||||||||||||||||||||||||||||||||
Your initial comment made sense after reading through the openai docpage. so I opened up my site to add those to robots.txt, turns out I had added all 3 of those user-agents to my robots file [0], out of curiosity I asked chatgpt about my site and it did scrape it, it even mentioned articles that have been published after adding the robots file | |||||||||||||||||||||||||||||||||||||||||
|