Remix.run Logo
kriro 4 days ago

Really happy to see this and will give it a good spin. They seem to be doing things the right way in my subjective opinion:

""" To implement this filter, we begin by ranking URL domains according to the volume of texts they contribute to the FineWeb (Penedo et al., 2024a) and FineWeb-2 (Penedo et al., 2025) corpus, as an approximation of web-level English and multilingual data. From this ranking, we select the top one million English domains and the top one million non-English domains. Due to domain overlap and the fact that some sites are now offline, the total number of accessible robots.txt files is smaller than two million. For each domain that remains reachable, we retrieve its robots.txt file as of January 2025 and examine the directives relevant to AI training. In particular, we focus on those targeting the AI-specific user agents listed in Appendix A. Any contents blocked by the current robots.txt is removed retroactively from the entire 2013-2024 range of the training dataset. We follow an opt-out policy, that is, if the corresponding robots.txt files are not available, we consider the data usable for training. The filtering process results in an estimated token loss of approximately 8% in English data and 4% in multilingual data. """

mycall 4 days ago | parent [-]

> Any contents blocked by the current robots.txt is removed retroactively from the entire 2013-2024 range of the training dataset

Why not check historical versions of the robots.txt (e.g. archive.org) and contain the retroactive cutoff to a certain date range, parsing the robots.txt accordingly? That might increase the corpus size within legal and fair use boundaries.

lllllm 4 days ago | parent | next [-]

common crawl anyway respects the CCbot opt-out every time they do a crawl.

we went a step further because back in old ages (2013 is our oldest training data) LLMs did not exist, so website owners opting out today of AI crawlers might like the option to also remove their past contents.

arguments can be made either way but we tried to remain on the cautious side at this point.

we also wrote a paper on how this additional removal affects downstream performance of the LLM https://arxiv.org/abs/2504.06219 (it does so surprisingly little)

pdpi 3 days ago | parent | next [-]

"I didn't know to withdraw consent" isn't the same as "I consent". Thank you for doing the right thing.

mycall 3 days ago | parent | prev [-]

Ah good points, thanks for the clarification.

3np 4 days ago | parent | prev [-]

I imagine coverage is sparse enough to not be worth it.