> Any contents blocked by the current robots.txt is removed retroactively from the entire 2013-2024 range of the training dataset

Why not check historical versions of the robots.txt (e.g. archive.org) and contain the retroactive cutoff to a certain date range, parsing the robots.txt accordingly? That might increase the corpus size within legal and fair use boundaries.

▲

lllllm 4 days ago | parent | next [-]

common crawl anyway respects the CCbot opt-out every time they do a crawl.

we went a step further because back in old ages (2013 is our oldest training data) LLMs did not exist, so website owners opting out today of AI crawlers might like the option to also remove their past contents.

arguments can be made either way but we tried to remain on the cautious side at this point.

we also wrote a paper on how this additional removal affects downstream performance of the LLM https://arxiv.org/abs/2504.06219 (it does so surprisingly little)

	▲	pdpi 3 days ago \| parent \| next [-]
		"I didn't know to withdraw consent" isn't the same as "I consent". Thank you for doing the right thing.
	▲	mycall 3 days ago \| parent \| prev [-]
		Ah good points, thanks for the clarification.

▲

3np 4 days ago | parent | prev [-]

I imagine coverage is sparse enough to not be worth it.