Remix.run Logo
rozab 5 days ago

After I set up a self hosted git forge a little while ago, I found that within minutes it immediately got hammered by OpenAI, Anthropic, etc. They were extremely aggressive, grabbing every individual file from every individual commit, one at a time.

I hadn't backlinked the site anywhere and was just testing, so I hadn't thought to put up a robots.txt. They must have found me through my cert registration.

After I put up my robots.txt (with explicit UA blocks instead of wildcards, I heard some ignore them), I found after a day or so the scraping stopped completely. The only ones I get now are vulnerability scanners, or random spiders taking just the homepage.

I know my site is of no consequence, but for those claiming OpenAI et al ignore robots.txt I would really like to see some evidence. They are evil and disrespectful and I'm gutted they stole my code for profit, but I'm still sceptical of these claims.

Cloudflare have done lots of work here and have never mentioned crawlers ignoring robots.txt:

https://blog.cloudflare.com/control-content-use-for-ai-train...