▲ | mrsilencedogood 3 days ago | |||||||||||||||||||||||||
fortunately it is now easier than ever to do small-scale scraping, the kind yt-dlp does. I can literally just go write a script that uses headless firefox + mitmproxy in about an hour or two of fiddling, and as long as I then don't go try to run it from 100 VPS's and scrape their entire website in a huge blast, I can typically archive whatever content I actually care about. Basically no matter what protection mechanisms they have in place. Cloudflare won't detect a headless firefox at low (and by "low" I mean basically anything you could do off your laptop from your home IP) rates, modern browser scripting is extremely easy, so you can often scrape things with mild single-person effort even if the site is an SPA with tons of dynamic JS. And obviously at low scale you can just solve captchas yourself. I recently wrote a scraper script that just sent me a discord ping whenever it ran into a captcha, and i'd just go look at my laptop and fix it, and then let it keep scraping. I was archiving a comic I paid for but was in a walled-garden app that obviously didn't want you to even THINK of controlling the data you paid for. | ||||||||||||||||||||||||||
▲ | wraptile 3 days ago | parent [-] | |||||||||||||||||||||||||
> fortunately it is now easier than ever to do small-scale scraping, the kind yt-dlp does. this is absolutely not the case. I've been web scraping since 00s and you could just curl any html or selenium the browser for simple automation but now it's incredibly complex and expensive even with modern tools like playwright and all of the monthly "undetectable" flavors of it. Headless browsers are laughably easy to detect because they leak the fact they are being automated and that they are headless. Not to even mention all of the fingerprinting. | ||||||||||||||||||||||||||
|