Remix.run Logo
kennywinker 2 hours ago

Do any agents respect agents.txt?

Is there a way to opt my websites out of ai data collection?

wolttam 2 hours ago | parent | next [-]

Any measure you put in place can/will be ignored by the actors who never planned to respect your wishes in the first place.

That's just how the web works, though.

cortesoft 2 hours ago | parent [-]

This is true for measures that require the actor to respect your wishes, but doesn't apply to measures that force them to.

wolttam an hour ago | parent [-]

This is just like security; the most secure system is the one that nobody can use.

I think the proof-of-work approach that anubis[0] takes is pretty interesting.

I love the idea of having to do a small amount of work for the author of the content in order to get access to their content. It would be interesting to a scheme where the proof-of-work that clients do in systems like anubis actually had a way to directly benefit the author.

[0]: https://github.com/TecharoHQ/anubis

ghostlyy 2 hours ago | parent | prev | next [-]

partial answer: the major labs (Anthropic, OpenAI) do respect robots.txt for their named crawlers, so blocking ClaudeBot/GPTBot in robots.txt works for those specific bots. What you can't easily opt out of is the indirect ingestion via Common Crawl, scraped datasets, and unnamed crawlers. agents.txt doesn't change that picture. The Allow-Training vs Allow-RAG split in the default is the useful part of the file. They're different operations with different costs to the site owner. Training is a one-time bulk ingest. RAG is a runtime fetch per query. A site owner might reasonably allow one and not the other.

ninjin 2 hours ago | parent [-]

I can report that Facebook does not respect robots.txt. Heck, I even mailed domain@fb.com with the specific IP ranges and log samples three times over a month and they of did not even respond. Keeps on wasting my CPU cycles to this day by crawling massive development forks (I hope they choke on the data...):

    $ (cat /var/www/logs/access.log; zcat /var/www/logs/access.log*.gz) | grep 2a03:2880: | wc -l
    626396
About three hits per second for months now.
dylan604 an hour ago | parent | next [-]

Can you serve them a specific file that would make it expensive on their end?

ninjin an hour ago | parent [-]

If I had the time and energy, I would make some sort of simple code language model and generate infinite junk and feed that to them in the hope that it ruins their future training runs. But, I lack the former and some of the latter. Alternatively, maybe I would actually read one of those "backdoor papers" and try to inject something like that.

dylan604 an hour ago | parent [-]

I was wondering if this could be done without being malicious to that level. If they are costing you money, then I have no moral qualms playing in kind. Taking that next step would then give up the moral high ground and potentially introduce yourself to legally questionable grounds.

I get the lack of time/energy for this type of thing. It is one of those projects that could be satisfying for yourself, but very hard to justify if you're a family person but something a younger person might get a lot of pleasure from.

drcongo an hour ago | parent | prev [-]

I block their entire ASN when they do that.

sschueller 2 hours ago | parent | prev | next [-]

Well Claude still thinks it shouldn't read AGENTS.md [1] so they probably also don't care about agents.txt on a web server...

[1] https://github.com/anthropics/claude-code/issues/6235

embedding-shape 2 hours ago | parent | prev [-]

Add HTTP Basic Auth in front of your website, then share the credentials with people who are allowed to view your website. Make sure you don't hand our credentials to employees of OpenAI, Anthropic, xAI or Microsoft.