▲ | simonw 5 days ago | |||||||
No, my argument here is that it's not OK to say "but OpenAI are obviously lying about following robots.txt for their training crawler" when their documentation says they obey robots.txt. There's plenty to criticize AI companies for. I think it's better to stick to things that are true. | ||||||||
▲ | latexr 5 days ago | parent | next [-] | |||||||
> when their documentation says Their documentation could say Sam Altman is the queen of England. It wouldn’t make it true. OpenAI has been repeatedly caught lying about respecting robots.txt. https://www.businessinsider.com/openai-anthropic-ai-ignore-r... https://web.archive.org/web/20250802052421/https://mailman.n... https://www.reddit.com/r/AskProgramming/comments/1i15gxq/ope... | ||||||||
| ||||||||
▲ | whilenot-dev 5 days ago | parent | prev [-] | |||||||
Agree, and in this same vein TFA was talking about LLMs in general, not OpenAI specifically. While I get your concern and would also like to avoid any sensationalism, there is still this itch about the careful wording in all these company statements. For example, who's the "user" that "ask[s] Perplexity a question" here? Putting on my software engineer hat with its urge for automation, it could very well be that Perplexity maintains a list of all the sites blocked for the PerplexityBot user agent through robots.txt rules. Such a list would help for crawling optimization, but could also be used to later have any employer asking Perplexity a certain question that would attempt to re-crawl the site with the Perplexity‑User user agent anyway (the one ignoring robot.txt rules). Call it the work of the QA department. Unless we'd work for such a company in a high position we'd never really know, and the existing violations of trust - just in regard to copyrighted works alone(!) - is enough rightful reason to keep a certain mistrust by default when it comes to young companies that are already evaluated in the billions and the handling of their most basic resources. |