Remix.run Logo
simonw 5 days ago

No, my argument here is that it's not OK to say "but OpenAI are obviously lying about following robots.txt for their training crawler" when their documentation says they obey robots.txt.

There's plenty to criticize AI companies for. I think it's better to stick to things that are true.

latexr 5 days ago | parent | next [-]

> when their documentation says

Their documentation could say Sam Altman is the queen of England. It wouldn’t make it true. OpenAI has been repeatedly caught lying about respecting robots.txt.

https://www.businessinsider.com/openai-anthropic-ai-ignore-r...

https://web.archive.org/web/20250802052421/https://mailman.n...

https://www.reddit.com/r/AskProgramming/comments/1i15gxq/ope...

simonw 5 days ago | parent [-]

Those three links don't support your argument here.

The Business Insider one is a paywalled rehash of this Reuters story https://www.reuters.com/technology/artificial-intelligence/m... - which was itself a report based on some data-driven PR by a startup, TollBit, who sell anti-scraping technology. Here's that report: https://tollbit.com/bots/24q4/

I downloaded a copy and found it actually says "OpenAI respects the signals provided by content owners via robots.txt allowing them to disallow any or all of its crawlers". I don't know where the idea that TollBit say OpenAI don't obey robots.txt comes from.

The second one is someone saying that their site which didn't use robots.txt was aggressively crawled.

The third one claims to prove OpenAI are ignoring robots.txt but shows request logs for user-agent ChatGPT-User which is NOT the same thing as GPTBot, as documented on https://platform.openai.com/docs/bots

whilenot-dev 5 days ago | parent | prev [-]

Agree, and in this same vein TFA was talking about LLMs in general, not OpenAI specifically. While I get your concern and would also like to avoid any sensationalism, there is still this itch about the careful wording in all these company statements.

For example, who's the "user" that "ask[s] Perplexity a question" here? Putting on my software engineer hat with its urge for automation, it could very well be that Perplexity maintains a list of all the sites blocked for the PerplexityBot user agent through robots.txt rules. Such a list would help for crawling optimization, but could also be used to later have any employer asking Perplexity a certain question that would attempt to re-crawl the site with the Perplexity‑User user agent anyway (the one ignoring robot.txt rules). Call it the work of the QA department.

Unless we'd work for such a company in a high position we'd never really know, and the existing violations of trust - just in regard to copyrighted works alone(!) - is enough rightful reason to keep a certain mistrust by default when it comes to young companies that are already evaluated in the billions and the handling of their most basic resources.