Remix.run Logo
Retric 5 days ago

> The first isn't worth arguing against: it's the idea that LLM vendors ignore your robots.txt file even when they clearly state that they'll obey it:

That’s testable and you can find content “protected” by robots.txt regurgitated by LLM’s. In practice it doesn’t matter if that’s through companies lying or some 3rd party scraping your content and then getting scraped.

simonw 5 days ago | parent | next [-]

There's a subtle but important difference between crawling data to train a model and accessing data as part of responding to a prompt and then piping that content into the context in order to summarize it (which may be what you mean by "regurgitation" here, I'm not sure.)

I think that distinction is lost on a lot of people, which is understandable.

simonw 5 days ago | parent | prev [-]

Do you have an example that demonstrates that?

whilenot-dev 5 days ago | parent [-]

User Agent "Perplexity‑User"[0]:

> Since a user requested the fetch, this fetcher generally ignores robots.txt rules.

[0]: https://docs.perplexity.ai/guides/bots

Lerc 5 days ago | parent | next [-]

There's definitely a distinction between fetching data for training and fetching data as an agent on behalf of a user. I guess you could demand that any program that identifies itself as a user agent should be blocked, but it seems counterproductive.

theamk 5 days ago | parent [-]

Counterproductive for what?

If I am writing for entertainment value, I see no problem with blocking all AI agents - the goal of text is to be read by humans after all.

For technical texts, one might want to block AI agents as well - they often omit critical parts and hallucinate. If you want your "DON'T DO THIS" sections to be read, better block them.

nine_k 5 days ago | parent | prev [-]

But this is more like `curl https://some/url/...` ignoring robots.txt.

Crawlers are the thing that should honor robots.txt, "nofollow", etc.

whilenot-dev 5 days ago | parent [-]

The title of this site is "Perplexity Crawlers"...

And it clearly voids simonw's stance

> it's the idea that LLM vendors ignore your robots.txt file even when they clearly state that they'll obey it

simonw 5 days ago | parent [-]

Where did Perplexity say they would obey robots.txt?

They explicitly document that they do not obey robots.txt for that form of crawling (user-triggered, data is not gathered for training.)

Their documentation is very clear: https://docs.perplexity.ai/guides/bots

whilenot-dev 5 days ago | parent [-]

I thought your emphasis was on the "it's the idea that LLM vendors ignore your robots.txt file" part of your statement... Now your point is it's okay for them to ignore it because they announce that their crawler "ignores robots.txt rules"?

simonw 5 days ago | parent [-]

No, my argument here is that it's not OK to say "but OpenAI are obviously lying about following robots.txt for their training crawler" when their documentation says they obey robots.txt.

There's plenty to criticize AI companies for. I think it's better to stick to things that are true.

latexr 5 days ago | parent | next [-]

> when their documentation says

Their documentation could say Sam Altman is the queen of England. It wouldn’t make it true. OpenAI has been repeatedly caught lying about respecting robots.txt.

https://www.businessinsider.com/openai-anthropic-ai-ignore-r...

https://web.archive.org/web/20250802052421/https://mailman.n...

https://www.reddit.com/r/AskProgramming/comments/1i15gxq/ope...

simonw 5 days ago | parent [-]

Those three links don't support your argument here.

The Business Insider one is a paywalled rehash of this Reuters story https://www.reuters.com/technology/artificial-intelligence/m... - which was itself a report based on some data-driven PR by a startup, TollBit, who sell anti-scraping technology. Here's that report: https://tollbit.com/bots/24q4/

I downloaded a copy and found it actually says "OpenAI respects the signals provided by content owners via robots.txt allowing them to disallow any or all of its crawlers". I don't know where the idea that TollBit say OpenAI don't obey robots.txt comes from.

The second one is someone saying that their site which didn't use robots.txt was aggressively crawled.

The third one claims to prove OpenAI are ignoring robots.txt but shows request logs for user-agent ChatGPT-User which is NOT the same thing as GPTBot, as documented on https://platform.openai.com/docs/bots

whilenot-dev 5 days ago | parent | prev [-]

Agree, and in this same vein TFA was talking about LLMs in general, not OpenAI specifically. While I get your concern and would also like to avoid any sensationalism, there is still this itch about the careful wording in all these company statements.

For example, who's the "user" that "ask[s] Perplexity a question" here? Putting on my software engineer hat with its urge for automation, it could very well be that Perplexity maintains a list of all the sites blocked for the PerplexityBot user agent through robots.txt rules. Such a list would help for crawling optimization, but could also be used to later have any employer asking Perplexity a certain question that would attempt to re-crawl the site with the Perplexity‑User user agent anyway (the one ignoring robot.txt rules). Call it the work of the QA department.

Unless we'd work for such a company in a high position we'd never really know, and the existing violations of trust - just in regard to copyrighted works alone(!) - is enough rightful reason to keep a certain mistrust by default when it comes to young companies that are already evaluated in the billions and the handling of their most basic resources.