Remix.run Logo
madeofpalk 9 hours ago

Is there any evidence or hints that these actually work?

It seems pretty reasonable that any scraper would already have mitigations for things like this as a function of just being on the internet.

raincole 8 hours ago | parent | next [-]

It might work against people just use their Mini Mac with OpenClaw to summarize news every morning, but it certainly won't work against Google.

More centralized web ftw.

hexage1814 7 hours ago | parent | next [-]

It also probably won't work if the person actually wants your content and is checking if the thing they scraped actually makes sense or it just noise. Like, none of these are new things. Site owners send junk/fake data to webscrapers since web scraping was invented.

otherme123 7 hours ago | parent | prev | next [-]

In my experience, Google (among others) plays nice. Just put "disallow: *" in your robots.txt, and they won't bother you again.

My current problem is OpenAI, that scans massively ignoring every limit, 426, 444 and whatever you throw at them, and botnets from East Asia, using one IP per scrap, but thousands of IPs.

LaGrange 7 hours ago | parent | prev [-]

> It might work against people just use their Mini Mac with OpenClaw to summarize news every morning,

Good enough for me.

> More centralized web ftw.

This ain't got anything to do with "centralized web," this kind of epistemological vandalism can't be shunned enough.

sd9 9 hours ago | parent | prev | next [-]

Even it did work, I just can't bring myself to care enough. It doesn't feel like anything I could do on my site would make any material difference. I'm tired.

20k 9 hours ago | parent [-]

I definitely get this. The thing that gives me hope is that you only need to poison a very small % of content to damage AI models pretty significantly. It helps combat the mass scraping, because a significant chunk of the data they get will be useless, and its very difficult to filter it by hand

lucasfin000 7 hours ago | parent [-]

The asymmetry is what makes this very interesting. The cost to inject poison is basically zero for the site owner, but the cost to detect and filter it at scale is significant for the scraper. That math gets a lot worse for them as more sites adopt it. It doesn't solve the problem, but it changes the economics.

xyzal 6 hours ago | parent | prev | next [-]

About two years ago, I made up reference to a nonexistent python library and put code "using" it in just 5 GitHub repos. Several months later the free ChatGPT picked it up. So IMO it works.

logicprog 6 hours ago | parent [-]

Via websearch? Or training?

bediger4000 7 hours ago | parent | prev | next [-]

The search engine crawlers are sophisticated enough, but Meta's are not. Neither is Anthropic's Claude crawler. Source: personal experience trying garbage generators on Yandex, Blexbot, Meta's and Anthropics crawlers.

I'm completely uncertain that the unsophisticated garbage I generated makes any difference, much less "poisons" the LLMs. A fellow can dream, can't he?

spiderfarmer 8 hours ago | parent | prev | next [-]

There are hundreds of bots using residential proxies. That is not free. Make them pay.

m00dy 8 hours ago | parent | prev | next [-]

it won't work, especially on gemini. Googlebot is very experienced when it comes to crawling. It might work for OpenAI and others maybe.

nubg 9 hours ago | parent | prev | next [-]

What kind of migitations? How would you detect the poison fountain?

avereveard 9 hours ago | parent | next [-]

style="display: none;" aria-hidden="true" tabindex="1"

many scraper already know not to follow these, as it's how site used to "cheat" pagerank serving keyword soups

m00dy 8 hours ago | parent | next [-]

Google will give your website a penalty for doing this.

phplovesong 5 hours ago | parent | prev [-]

You dont have to use this. You can have it visible bit hide it for humans with other easy tricks.

cuu508 4 hours ago | parent [-]

Scrapers can work around those other easy tricks too.

GaggiX 9 hours ago | parent | prev [-]

Because the internet is noisy and not up to date all recent LLMs are trained using Reinforcement Learning with Verifiable Rewards, if a model has learned the wrong signature of a function for example it would be apparent when executing the code.

phoronixrly 9 hours ago | parent | prev [-]

It does work, on two levels:

1. Simple, cheap, easy-to-detect bots will scrape the poison, and feed links to expensive-to-run browser-based bots that you can't detect in any other way.

2. Once you see a browser visit a bullshit link, you insta-ban it, as you can now see that it is a bot because it has been poisoned with the bullshit data.

My personal preference is using iocaine for this purpose though, in order to protect the entire server as opposed to a single site.