new | show | ask | jobs Github

dijit 12 hours ago

Uncharitable.

Robots.txt is already the understood mechanism for getting robots to avoid scraping a website.

▲

simonw 11 hours ago | parent [-]

People often use specific user agents in there, which is hard if you don't know what the user agents are in advance!

▲

lxgr 8 hours ago | parent | next [-]

That seems like a potentially very useful addition to the robots.txt "standard": Crawler categories.

Wanting to disallow LLM training (or optionally only that of closed-weight models), but encouraging search indexing or even LLM retrieval in response to user queries, seems popular enough.

▲

6 hours ago | parent | prev | next [-]

[deleted]

▲

wat10000 11 hours ago | parent | prev [-]

If you're using a specific user agent, then you're saying "I want this specific user agent to follow this rule, and not any others." Don't be surprised when a new bot does what you say! If you don't want any bots reading something, use a wildcard.

	▲	lxgr 8 hours ago \| parent \| next [-]
		Yes, but given the lack of generic "robot types" (e.g. "allow algorithmic search crawlers, allow archival, deny LLM training crawlers"), neither opt-in nor opt-out seems like a particularly great option in an age where new crawlers are appearing rapidly (and often, such as here, are announced only after the fact).
	▲	simonw 10 hours ago \| parent \| prev [-]
		Sure, but I still think it's OK to look at Apple with a raised eyebrow when they say "and our previously secret training data crawler obeys robots.txt so you can always opt out!"