Remix.run Logo
kitku 5 days ago

This reminds me of the Nepenthes tarpit [1], which is an endless source of ad-hoc generated garbled mess which links to itself over and over.

Probably more effective at poisoning the dataset if one has the resources to run it.

[1]: https://zadzmo.org/code/nepenthes/

fleebee 5 days ago | parent | next [-]

I'm running Iocaine[1] which is essentially the same thing on my tiny $3/mo VPS and it's handling crawlers bombarding the honeypot with ~12 requests per second just fine. It's using about 30 MB of RAM.

[1]: https://iocaine.madhouse-project.org/

treetalker 5 days ago | parent [-]

Odorless, tasteless, and among the more deadly poisons known to crawlers!

BrenBarn 5 days ago | parent [-]

Unfortunately they will spend the next several years building up an immunity.

8organicbits 5 days ago | parent | prev [-]

Do we know if LLM scrapers are running JavaScript on the pages? If they are, maybe it's worth offloading the Markov model to the client side.