Remix.run Logo
simonw 12 hours ago

One problem with Apple's approach here is that they were scraping the web for training data long before they published the details of their activities and told people how to exclude them using robots.txt

dijit 12 hours ago | parent | next [-]

Uncharitable.

Robots.txt is already the understood mechanism for getting robots to avoid scraping a website.

simonw 11 hours ago | parent [-]

People often use specific user agents in there, which is hard if you don't know what the user agents are in advance!

lxgr 8 hours ago | parent | next [-]

That seems like a potentially very useful addition to the robots.txt "standard": Crawler categories.

Wanting to disallow LLM training (or optionally only that of closed-weight models), but encouraging search indexing or even LLM retrieval in response to user queries, seems popular enough.

6 hours ago | parent | prev | next [-]
[deleted]
wat10000 11 hours ago | parent | prev [-]

If you're using a specific user agent, then you're saying "I want this specific user agent to follow this rule, and not any others." Don't be surprised when a new bot does what you say! If you don't want any bots reading something, use a wildcard.

lxgr 8 hours ago | parent | next [-]

Yes, but given the lack of generic "robot types" (e.g. "allow algorithmic search crawlers, allow archival, deny LLM training crawlers"), neither opt-in nor opt-out seems like a particularly great option in an age where new crawlers are appearing rapidly (and often, such as here, are announced only after the fact).

simonw 10 hours ago | parent | prev [-]

Sure, but I still think it's OK to look at Apple with a raised eyebrow when they say "and our previously secret training data crawler obeys robots.txt so you can always opt out!"

conradev 6 hours ago | parent | prev [-]

They documented it in 2015: https://www.macrumors.com/2015/05/06/applebot-web-crawler-si...