Remix.run Logo
efilife 16 hours ago

> Alright so if you run a self-hosted blog, you've probably noticed AI companies scraping it for training data. ... There isn't much you can do about it without cloudflare

I'm sorry, what? I can't believe I am reading this on HackerNews. All you have to do is code your own, BASIC captcha-like system. You can just create a page that sets a cookie using JS and check on the server whether it exists. 99.9999% of these scrapers can't execute JS and don't support cookies. You can go for a more sophisticated approach and analyze some more scraper tells (like reject short useragents). I do this and NEVER had a bot get past this and not a single user ever complained. It's extremely simple, I should ship this and charge people if no one seems to be able to figure this out by themselves.

n1xis10t 16 hours ago | parent | next [-]

Oops you just leaked your own intellectual property

ATechGuy 16 hours ago | parent | prev [-]

From ChatGPT:

This approach can stop very basic scripts, but the claim that “99.9999% of scrapers can’t execute JS or handle cookies” isn’t accurate anymore. Modern scraping tools commonly use headless browsers (Playwright, Puppeteer, Selenium), execute JavaScript, support cookies, and spoof realistic user agents. Any scraper beyond the most trivial will pass a JS-set cookie check without effort. That said, using a lightweight JS challenge can be reasonable as one signal among many, especially for low-value content and when minimizing user friction is a priority. It’s just not a reliable standalone defense. If it’s working for you, that likely means your site isn’t a high-value scraping target — not that the technique is fundamentally robust.

efilife 15 hours ago | parent | next [-]

From someone who actually does this stuff:

The claim is very accurate. Maybe not for the biggest websites, but very accurate for a self-hosted blog. You are not that important to waste compute power to set up a whole ass headless browser to scrape your page. Why am I even arguing with ChatGPT?

phyzome 15 hours ago | parent | prev [-]

There should be a new rule on HN: No posts that just go "I asked an LLM and it said..."

You're not adding anything to the conversation.

cyphar 10 hours ago | parent [-]

Yeah, I really have to wonder what the thought process is behind leaving such a comment. When people first started doing it I wondered if it was some kind of guerrilla outrage marketing campaign.

PunchyHamster 6 hours ago | parent | next [-]

There was no thought process

efilife 9 hours ago | parent | prev [-]

Maybe he wanted to verify whether what I was saying was true and asked ChatGPT, then tried to be helpful by pasting the response here?

cyphar 3 hours ago | parent | next [-]

Maybe I'm getting too jaded but I'm struggling to be quite that charitable.

The entireity of the human-written text in that comment was "From ChatGPT:" and it was formatted as though it was a slam-dunk "you're wrong, the computer says so" (imagine it was "From Wikipedia" followed by a quote disagreeing with you instead).

I'm sure some people do what you describe but then I would expect at least a little bit more explanation as to why they felt the need to paste a paragraph of LLM output into their comment. (While I would still disagree that it is in any way valuable, I would at least understand a bit about what they are trying to communicate.)

phyzome 3 hours ago | parent | prev [-]

Yeah, I agree that that's likely the thought process. It just happens to be the opposite of helpful.