▲ | lllllm 4 days ago | |
common crawl anyway respects the CCbot opt-out every time they do a crawl. we went a step further because back in old ages (2013 is our oldest training data) LLMs did not exist, so website owners opting out today of AI crawlers might like the option to also remove their past contents. arguments can be made either way but we tried to remain on the cautious side at this point. we also wrote a paper on how this additional removal affects downstream performance of the LLM https://arxiv.org/abs/2504.06219 (it does so surprisingly little) | ||
▲ | pdpi 3 days ago | parent | next [-] | |
"I didn't know to withdraw consent" isn't the same as "I consent". Thank you for doing the right thing. | ||
▲ | mycall 3 days ago | parent | prev [-] | |
Ah good points, thanks for the clarification. |