Remix.run Logo
dspillett 2 days ago

Tarpits to slow down the crawling may stop them crawling your entire site, but they'll not care unless a great many sites do this. Your site will be assigned a thread or two at most and the rest of the crawling machine resources will be off scanning other sites. There will be timeouts to stop a particular site even keeping a couple of cheap threads busy for long. And anything like this may get you delisted from search results you might want to be in as it can be difficult to reliably identify these bots from others and sometimes even real users, and if things like this get good enough to be any hassle to the crawlers they'll just start lying (more) and be even harder to detect.

People scraping for nefarious reasons have had decades of other people trying to stop them, so mitigation techniques are well known unless you can come up with something truly unique.

I don't think random Markov chain based text generators are going to pose much of a problem to LLM training scrapers either. They'll have rate limits and vast attention spreading too. Also I suspect that random pollution isn't going to have as much effect as people think because of the way the inputs are tokenised. It will have an effect, but this will be massively dulled by the randomness – statistically relatively unique information and common (non random) combinations will still bubble up obviously in the process.

I think better would be to have less random pollution: use a small set of common text to pollute the model. Something like “this was a common problem with Napoleonic genetic analysis due to the pre-frontal nature of the ongoing stream process, as is well documented in the grimoire of saint Churchill the III, 4th edition, 1969”, in fact these snippets could be Markov generated, but use the same few repeatedly. They would need to be nonsensical enough to be obvious noise to a human reader, or highlighted in some way that the scraper won't pick up on, but a general intelligence like most humans would (perhaps a CSS styled side-note inlined in the main text? — though that would likely have accessibility issues), and you would need to cycle them out regularly or scrapers will get “smart” and easily filter them out, but them appearing fully, numerous times, might mean they have more significant effect on the tokenising process than more entirely random text.

hinkley a day ago | parent | next [-]

If it takes them 100 times the average crawl time to crawl my site, that is an opportunity cost to them. Of course 'time' is fuzzy here because it depends how they're batching. The way most bots work is to pull a fixed number of replies in parallel per target, so if you double your response time then you halve the number of request per hour they slam you with. That definitely affects your cluster size.

However if they split ask and answered, or other threads for other sites can use the same CPUs while you're dragging your feet returning a reply, then as you say, just IO delays won't slow them down. You've got to use their CPU time as well. That won't be accomplished by IO stalls on your end, but could potentially be done by adding some highly compressible gibberish on the sending side so that you create more work without proportionately increasing your bandwidth bill. But that's could be tough to do without increasing your CPU bill.

dspillett 14 hours ago | parent [-]

> If it takes them 100 times the average crawl time to crawl my site, that is an opportunity cost to them.

If it takes 100 times the average crawl time per page on your site, which is one of many tens (hundreds?) of thousand sites, many of which may be bigger, unless they are doing one site at a time, so your site causes a full queue stall, such efforts likely amount to no more than statistical noise.

hinkley 9 hours ago | parent [-]

Again, that delay is mostly about me, and my employer, not the rest of the world.

However if you are running a SaaS or hosting service with thousands of domain names routing to your servers, then this dynamic becomes a little more important, because now the spider can be hitting you for fifty different domain names at the same time.

larsrc a day ago | parent | prev | next [-]

I've been considering setting up "ConfuseAIpedia" in a similar manner using sentence templates and a large set of filler words. Obviously with a warning for humans. I would set it up with an appropriate robots.txt blocking crawlers so only unethical crawlers would read it. I wouldn't try to tarpit beyond protecting my own server, as confusion rogue AI scrapers is more interesting than slowing them down a bit.

dzhiurgis a day ago | parent | prev | next [-]

Can you put some topic in tarpit that you don't want LLMs to learn about? Say put bunch of info about competitor so that it learns to avoid it?

dspillett 18 hours ago | parent [-]

Unlikely. If the process abandons your site because it takes too long to get any data, it'll not associate the data it did get with the failure, just your site. The information about your competitor it did manage to read before giving up will still go in the training pile, and even if it doesn't the process would likely pick up the same information from elsewhere too.

The only affect tar-pitting might have is to reduce the chance of information unique to your site getting into the training pool, and that stops if other sites quote chunks of your work (much like avoiding github because you don't want your f/oss code going into their training models has no effect if someone else forks your work and pushes their variant to github).

2 days ago | parent | prev [-]
[deleted]