Remix.run Logo
mrsilencedogood 3 days ago

fortunately it is now easier than ever to do small-scale scraping, the kind yt-dlp does.

I can literally just go write a script that uses headless firefox + mitmproxy in about an hour or two of fiddling, and as long as I then don't go try to run it from 100 VPS's and scrape their entire website in a huge blast, I can typically archive whatever content I actually care about. Basically no matter what protection mechanisms they have in place. Cloudflare won't detect a headless firefox at low (and by "low" I mean basically anything you could do off your laptop from your home IP) rates, modern browser scripting is extremely easy, so you can often scrape things with mild single-person effort even if the site is an SPA with tons of dynamic JS. And obviously at low scale you can just solve captchas yourself.

I recently wrote a scraper script that just sent me a discord ping whenever it ran into a captcha, and i'd just go look at my laptop and fix it, and then let it keep scraping. I was archiving a comic I paid for but was in a walled-garden app that obviously didn't want you to even THINK of controlling the data you paid for.

wraptile 3 days ago | parent [-]

> fortunately it is now easier than ever to do small-scale scraping, the kind yt-dlp does.

this is absolutely not the case. I've been web scraping since 00s and you could just curl any html or selenium the browser for simple automation but now it's incredibly complex and expensive even with modern tools like playwright and all of the monthly "undetectable" flavors of it. Headless browsers are laughably easy to detect because they leak the fact they are being automated and that they are headless. Not to even mention all of the fingerprinting.

sharpshadow 2 days ago | parent | next [-]

> modern browser scripting is extremely easy, so you can often scrape things with mild single-person effort even if the site is an SPA with tons of dynamic JS.

I think he means the JS part is now easy to run and scrape compared to the transition time from basic download scraping to JS execution/headless browser scraping. It is more complex but the tools haven’t been as evolved as they are now a couple of years ago.

2 days ago | parent | prev | next [-]
[deleted]
immibis 2 days ago | parent | prev | next [-]

mozilla-unified/dom/base/Navigator.cpp - find Navigator::Webdriver and make it always return false, then recompile.

johnisgood 3 days ago | parent | prev [-]

+1

I made a web scraper in Perl a few years ago. It no longer works because I need a headless browser now or whatever it is called these days.

Web scraping is MUCH WORSE TODAY[1].

[1] I am not yelling, just emphasizing. :)