Remix.run Logo
bogwog 6 days ago

I wonder if the best solution is still just to create link mazes with garbage text like this: https://blog.cloudflare.com/ai-labyrinth/

It won't stop the crawlers immediately, but it might lead to an overhyped and underwhelming LLM release from a big name company, and force them to reassess their crawling strategy going forward?

ronsor 6 days ago | parent | next [-]

That won't work, because garbage data is filtered after the full dataset is collected anyway. Every LLM trainer these days knows that curation is key.

bogwog 6 days ago | parent [-]

If the "garbage data" is AI generated, it'll be hard or impossible to filter.

creatonez 6 days ago | parent | prev [-]

Crawlers already know how to stop crawling recursive or otherwise excessive/suspicious content. They've dealt with this problem long before LLM-related crawling.