Remix clone Hacker News

new | show | ask | jobs Github

	▲	angelhadjiev 4 days ago
		Sites are deploying infinite fake-page mazes (Nepenthes, Locaine, etc.) to trap and poison AI training crawlers that ignore robots.txt. The motivation is understandable — Cloudflare reported 75% of AI web traffic in mid-2025 was training-related, and nearly 60% of reputable sites now block AI bots. The problem: tarpits don't check intent. They detect automated request patterns. If your price tracker follows links systematically, skips JS execution, or hits pages at regular intervals — it looks identical to GPTBot. The trap fires anyway. The collateral damage is real. One Rutgers/Wharton study found sites with aggressive crawler blocking saw a 23% drop in total traffic, including human visitors. The escalation ladder is now at step 4: 1. robots.txt (gentleman's agreement) 2. User-agent filtering 3. Behavioural detection 4. Active tarpits — waste your compute, poison your data If you're running any data pipeline at scale, you need to validate responses now. Tarpits serve plausible-looking Markov garbage. If you're not checking, it's already in your database. Full writeup: https://foura.ai/blog/web-scraping-tarpits-collateral-damage