| ▲ | wraptile 3 days ago |
| Days of just getting data off the web are coming to an end as everything requires a full browser running thousands of lines of obfuscated js code now. So instead of a website giving me that 1kb json that could be cached now I start a full browser stack and transmit 10 megabytes through 100 requests, messing up your analytics and security profile and everyone's a loser. Yay. |
|
| ▲ | nananana9 3 days ago | parent | next [-] |
| On the bright side, that opens an opportunity for 10,000 companies whose only activity is scraping 10MB worth of garbage and providing a sane API for it. Luckily all that is becoming a non-issue, as most content on these websites isn't worth scraping anymore. |
| |
| ▲ | judge2020 3 days ago | parent [-] | | *and whose only customers are using it for AI training | | |
| ▲ | TeMPOraL 3 days ago | parent [-] | | They can afford it because the market rightfully bets on such trained models being more useful than upstream sources. In fact, at this point in time (it won't last), one of the most useful applications of LLMs is to have them deal with all the user-hostile crap that's bulk of the web today, so you don't have to suffer through it yourself. It's also the easiest way to get any kind of software interoperability at the moment (this will definitely not last long). |
|
|
|
| ▲ | daemin 3 days ago | parent | prev | next [-] |
| This 1kb os json still sounds like a modern thing, where you need to download many MB of JavaScript code to execute and display the 1kb json data. What you want is to just download the 10-20kb html file, maybe a corresponding css file, and any images referenced by the html. Then if you want the video you just get the video file direct. Simple and effective, unless you have something to sell. |
| |
| ▲ | pjc50 3 days ago | parent [-] | | The main reason for doing video through JS in the first place, other than obfuscation, is variable bitrate support. Oddly enough some TVs will support variable bitrate HLS directly, and I believe Apple devices, but not regular browsers. See https://github.com/video-dev/hls.js/ > unless you have something to sell Video hosting and its moderation is not cheap, sadly. Which is why we don't see many competitors. | | |
| ▲ | a96 2 days ago | parent | next [-] | | P2P services proved a long ago that hosting is not a problem. Politics is a problem. What we don't see is more web video services and services that successfully trick varied content creators to upload regularly to their platform. https://en.wikipedia.org/wiki/PeerTube also must be mentioned here. | |
| ▲ | Zopieux 3 days ago | parent | prev [-] | | And by "not many" you really mean zero competitors. (before you ask: Vimeo is getting sold to an enshitification company) | | |
| ▲ | axiolite 3 days ago | parent [-] | | Those "zero" include: Rumble, Odysee, Dailymotion, Twitch, Facebook watch... etc. And a decent list here: https://ideaexplainers.com/video-sites-like-youtube/ | | |
| ▲ | pjc50 3 days ago | parent | next [-] | | Twitch does live streaming but recently severely limited the extent of free hosting for archived content. Not actually heard of the first two, what's their USP? | |
| ▲ | treyd 3 days ago | parent | prev [-] | | Rumble and Odysee and populated with crazy ragebaiting rightwingers, conspiracy theorists, and pseudo-libertarians. Twitch has the issues the other commenter described, and both Twitch and Facebook are owned by billionaires who are actively collaborating with the current authoritarian regime. Facebook in particular is a risk space for actually exercising free speech and giving coherent critiques of authority. Dailymotion is... maybe okay? As a company it seems like it's on life support. There's a "missing middle" between the corporate highly produced content that's distributed across all platforms and being a long tail dumping ground. I did find things like university lectures there, but there isn't creators actually trying to produce content for Dailymotion like there is on YouTube. | | |
| ▲ | axiolite 2 days ago | parent [-] | | > Rumble and Odysee and populated with crazy ragebaiting rightwingers, conspiracy theorists, and pseudo-libertarians. So, just like Youtube, then? | | |
| ▲ | treyd a day ago | parent [-] | | Proportionally speaking, there's a much higher concentration. |
|
|
|
|
|
|
|
| ▲ | xnx 3 days ago | parent | prev | next [-] |
| It's an arms race. Websites have become stupidly/unnecessarily/hostilely complicated, but AI/LLMs have made it possible (though more expensive) to get whatever useful information exists out of them. Soon, LLMs will be able to complete any Captcha a human can within reasonable time. When that happens, the "analog hole" may be open permanently. If you can point a camera and a microphone at it, the AI will be able to make better sense of it than a person. |
| |
| ▲ | Gigachad 3 days ago | parent | next [-] | | The future will just be every web session gets tied to a real ID and if the service detects you as a bot you just get blocked by ID. | | |
| ▲ | wraptile 3 days ago | parent | next [-] | | > The future will just be every web session gets tied to a real ID This seems like an awful future. We already had this in form of limited ipv4 addresses wher each IP is basically an identity. People started buying up ip addresses and selling them as proxies. So any other form of ID would suffer the same fate unless enforced at government level. Worst case scenario we have 10,000 people sitting in front of the screens clicking page links because hiring someone to use their "government id" to mindlessly browse the web is the only way to get data of the public web. That's not the future we should want. | |
| ▲ | xnx 3 days ago | parent | prev [-] | | I definitely agree logins will be required for many more sites, but how would the site be able to distinguish humans from bots controlling the browser? Captcha is almost obsolete. ARC AGI is too cumbersome for verifying every time. | | |
| ▲ | Gigachad 3 days ago | parent [-] | | Small scale usage at the same level as a normal person would probably go under the radar, but if you try scraping, running multiple accounts or posting any more than you would a normal user it’ll be picked up once they can link all actions to a real person. If you are just asking Siri to load a page for you, that probably gets tolerated. Maybe very sensitive sites will go verified mobile platform only and Apple/Google will provide some kind of AI free compute environment like how they can block screen recording or custom roms today. Yes it is 100% the death of the free and open computing environment. But captchas are no longer going to be sufficient. It seems realistic to block bots if you are willing to fully lock down everything. | | |
| ▲ | xnx 3 days ago | parent [-] | | The next frontier is entire fake personas to login and scrape sites ... which is why government/real-world verification will be required soon. |
|
|
| |
| ▲ | goku12 3 days ago | parent | prev [-] | | Please remember that an LLM accessing any website isn't the problem here. It's the scraping bots that saturate the server bandwidth (a DoS attack of sorts) to collect data to train the LLMs with. An LLM solving a captcha or an Anubis style proof of work problem isn't a big concern here, because the worst they're going to do with the collected data is to cache them for later analysis and reporting. Unlike the crawlers, LLMs don't have any incentives in sucking up huge amounts of data like a giant vacuum cleaner. | | |
| ▲ | TeMPOraL 3 days ago | parent [-] | | Scraping was a thing before LLMs, there's a whole separate arms race around this for regular competition and "industrial espionage" reasons. I'm not really sure why model training would become a noticeable fraction of scrapping activity - there's only few players on the planet that can afford to train decent LLMs in the first place, and they're not going to re-scrape the content they already have ad infinitum. | | |
| ▲ | int_19h 3 days ago | parent [-] | | > they're not going to re-scrape the content they already have That's true for static content, but much of it is forums and other places like that where the main value is that new content is constantly generated - but needs to be re-scraped. | | |
| ▲ | a96 2 days ago | parent [-] | | If only sites agreed on putting a machine readable URL somewhere that lists all items by date. Like a site summary or a syndication stream. And maybe like a "map" of a static site. It would be so easy to share their updates with other interested systems. | | |
| ▲ | int_19h a day ago | parent [-] | | Why should they agree to make life even easier for people doing something they don't want? |
|
|
|
|
|
|
| ▲ | dpedu 3 days ago | parent | prev | next [-] |
| And it's all to sell more ads. |
|
| ▲ | mrsilencedogood 3 days ago | parent | prev | next [-] |
| fortunately it is now easier than ever to do small-scale scraping, the kind yt-dlp does. I can literally just go write a script that uses headless firefox + mitmproxy in about an hour or two of fiddling, and as long as I then don't go try to run it from 100 VPS's and scrape their entire website in a huge blast, I can typically archive whatever content I actually care about. Basically no matter what protection mechanisms they have in place. Cloudflare won't detect a headless firefox at low (and by "low" I mean basically anything you could do off your laptop from your home IP) rates, modern browser scripting is extremely easy, so you can often scrape things with mild single-person effort even if the site is an SPA with tons of dynamic JS. And obviously at low scale you can just solve captchas yourself. I recently wrote a scraper script that just sent me a discord ping whenever it ran into a captcha, and i'd just go look at my laptop and fix it, and then let it keep scraping. I was archiving a comic I paid for but was in a walled-garden app that obviously didn't want you to even THINK of controlling the data you paid for. |
| |
| ▲ | wraptile 3 days ago | parent [-] | | > fortunately it is now easier than ever to do small-scale scraping, the kind yt-dlp does. this is absolutely not the case. I've been web scraping since 00s and you could just curl any html or selenium the browser for simple automation but now it's incredibly complex and expensive even with modern tools like playwright and all of the monthly "undetectable" flavors of it. Headless browsers are laughably easy to detect because they leak the fact they are being automated and that they are headless. Not to even mention all of the fingerprinting. | | |
| ▲ | sharpshadow 2 days ago | parent | next [-] | | > modern browser scripting is extremely easy, so you can often scrape things with mild single-person effort even if the site is an SPA with tons of dynamic JS. I think he means the JS part is now easy to run and scrape compared to the transition time from basic download scraping to JS execution/headless browser scraping. It is more complex but the tools haven’t been as evolved as they are now a couple of years ago. | |
| ▲ | 2 days ago | parent | prev | next [-] | | [deleted] | |
| ▲ | immibis 2 days ago | parent | prev | next [-] | | mozilla-unified/dom/base/Navigator.cpp - find Navigator::Webdriver and make it always return false, then recompile. | |
| ▲ | johnisgood 3 days ago | parent | prev [-] | | +1 I made a web scraper in Perl a few years ago. It no longer works because I need a headless browser now or whatever it is called these days. Web scraping is MUCH WORSE TODAY[1]. [1] I am not yelling, just emphasizing. :) |
|
|
|
| ▲ | einpoklum 3 days ago | parent | prev | next [-] |
| Those days are not coming to an end: * PeerTube and similar platforms for video streaming of freely-distributable content; * BitTorrent-based mechanisms for sharing large files (or similar protocols). Will this be inconvenient? At first, somewhat. But I am led to believe that in the second category one can already achieve a decent experience. |
| |
| ▲ | dotancohen 3 days ago | parent [-] | | To how many content creators have you written to request them share their content on PeerTube or BitTorrent? How did they respond? How will they monetize? | | |
| ▲ | einpoklum 3 days ago | parent [-] | | 1. Zero 2. N/A, but enough content creators on YT are very much aware of the kind of prison it is, especially in the years after the Adpocalypse. 3. Obviously, nobody should be able to monetize the copying of content. If it is released, it is publicly released. But they can use LibrePay/Patreon/Buy me a coffee, they can sell merch or signed copies of things, they can do live appearances, etc. | | |
| ▲ | a96 2 days ago | parent [-] | | 3. they already do, since YT just doesn't really pay and regularly flips out in weird ways |
|
|
|
|
| ▲ | pjc50 3 days ago | parent | prev | next [-] |
| I think this is just another indication of how the web is a fragile equilibrium in a very adversarial ecosystem. And to some extent, things like yt-dlp and adblocking only work if they're "underground". Once they become popular - or there's a commercial incentive, like AI training - there ends up being a response. |
|
| ▲ | elric 3 days ago | parent | prev | next [-] |
| Not only that, but soon it will require age verification and device attestation. Just in case you're trying to watch something you're not supposed to. |
|
| ▲ | 3 days ago | parent | prev | next [-] |
| [deleted] |
|
| ▲ | bjourne 3 days ago | parent | prev | next [-] |
| For now, yes, but soon CloudFlare and ever more annoying captchas may make that option practically impossible. |
| |
| ▲ | nutjob2 3 days ago | parent [-] | | You should be thankful for the annoying captchas, I hear they're moving to rectal scans soon. |
|
|
| ▲ | pmdr 3 days ago | parent | prev | next [-] |
| > Days of just getting data off the web are coming to an end All thanks to great ideas like downloading the whole internet and feeding it into slop-producing machines fueling global warming in an attempt to make said internet obsolete and prop up an industry bubble. The future of the internet is, at best, bleak. Forget about openness. Paywalls, authwalls, captchas and verification cans are here to stay. |
| |
| ▲ | TeMPOraL 3 days ago | parent [-] | | The Internet was turned into a slop warehouse well before LLMs became a thing - in fact, a big part of why ChatGPT et al. has so extreme adoption worldwide is because they let people accomplish many tasks without having to inflict on yourself the shitfest that's the modern web. Personally, when it became available, o3 model in ChatGPT cut my use of web search by more than half, and it wasn't because Google became bad at search (I use Kagi anyway) - it's because even the best results are all shit, or embedded in shit websites, and the less I need to browse through that, the better for me. | | |
| ▲ | pmdr 2 days ago | parent [-] | | > The Internet was turned into a slop warehouse well before LLMs became a thing I suppose that's thanks to Google and their search algos favoring ad-ridden SEO spam. LLMs are indeed more appealing and convenient. But I fear that legitimate websites (ad-supported or otherwise) that actually provide useful information will be on the decline. Let's just hope then that updated information will find its way into LLMs when such websites are gone. | | |
| ▲ | TeMPOraL 2 days ago | parent [-] | | In terms of utility as training data, the Internet is a poisoned well now, and the poison is becoming more potent over time. Part of it is the SEO spam and content marketing slop, both of which kept growing and accumulating. Part of it is even more slop produced by LLMs, especially by cheap (= weak) models, but also by LLMs in general (any LLM used to produce content is doing worse job at it than a model from subsequent generation, so it's kinda always suboptimal for training purposes). And now part of it are people mass-producing bullshit out of spite, just to screw with AI companies. SNR on the web is dropping like a brick falling into a black hole. It's a bit of a gamble at this point - will the larger models, or new architectures, or training protocols, be able to reject all that noise and extract the signal? If yes, then training on the Internet is still safe. If not, it's probably better for them to freeze the datasets blindly scrapped from the Internet now, and focus on mining less poisoned sources (like books, academic papers, and other publications not yet ravaged by the marketing communications cancer[0], also ideally published before the last 2 years). I don't know which is more likely - but I'm not dismissing the possibility that the models will be able to process increasingly poisoned data sets just fine, if the data sets are large enough, because of a very basic and powerful idea: self-consistency. True information is always self-consistent, because it reflects the underlying reality. Falsehoods may be consistent in the small, but at scale they're not. |
|
|
|
|
| ▲ | 3 days ago | parent | prev | next [-] |
| [deleted] |
|
| ▲ | SV_BubbleTime 3 days ago | parent | prev [-] |
| Do you know what Accelerate means? I want them to go overboard. I want BigTech to go nuts on this stuff. I want broken systems and nonsense. Because that’s the only way we’re going to get anything better. |
| |
| ▲ | jdiff 3 days ago | parent | next [-] | | Accelerationism is a dead-end theory with major holes in its core. Or I should say, "their" core, because there's a million distant and mutually-incompatible varieties. Everyone likes to say "gosh, things are awful, it MUST end in collapse, and after the collapse everyone will see things MY way." They can't all be right. And yet, all of them with their varied ideas still think it'll be a good idea to actively push to make things worse in order to bring on the collapse more quickly. It doesn't work. There aren't any collapses like that to be had. Big change happens incrementally, a bit of refactoring and a few band-aids at a time, and pushing to make things worse doesn't help. | | |
| ▲ | exe34 3 days ago | parent | next [-] | | I'm not waiting for the collapse to fix things - I'm waiting for it so that I won't have any more distractions and I can go back to my books. | | |
| ▲ | jdiff 3 days ago | parent [-] | | As I said, there aren't any collapses like that to be had. Heaven and Earth will be moved to make the smallest change necessary to keep things flowing as they were. Banks aren't allowed to fail. Companies, despite lengthy strings of missteps and billions burned on dead ends, still remain on top. You can step away from the world (right now, no waiting required). But the world can remain irrational longer than you can wait for it to step away from you, and pushing for more irrationality won't make a dent in that. | | |
| ▲ | exe34 3 days ago | parent | next [-] | | Oh I think the world will push me away at the next Android update. If I can't root/firewall/adblock/syncthing/koreader, the mobile phone will simply become a phone again. | | |
| ▲ | TeMPOraL 3 days ago | parent [-] | | Ain't that right, eh? It's not the end of the world. Just the end of a whole lot of nice and fun possibilities we've grown to enjoy. |
| |
| ▲ | immibis 2 days ago | parent | prev [-] | | Everything that can't go on forever will eventually stop. On the other hand, the market can remain irrational longer than you can remain solvent. The basic governing principles of the economy were completely rewritten in 1971, were completely rewritten again in 2008, were completely rewritten again in 2020 - probably other times too - and there are only so many more things they can try. The USA is basically running as a pseudo-command economy at the top level now - how long do those typically last? - with big businesses being supported by the central bank. The economy should have collapsed in 1971, 2008 and 2020 (and probably other times) as well, but they kept finding new interventions that would have seemed completely ludicrous 20 years earlier. I mean, the Federal Reserve just buying financial assets? With newly printed money? (it still has a massive reserve of them, this program did not end, that money is still in circulation and it's propping a lot of economic numbers up) All predictions about when the musical chairs will end are probably wrong. The prediction that it'll end in the next N years is just as likely to be wrong, as the prediction that it won't. Some would argue it already has ended, decades ago, and we are currently living in the financial collapse - how many years of income does it take to get a house now? The collapse of Rome' took several centuries. At no point did the people think they were living in a collapsing empire. Each person just thought that how it was in their time was how it always was. |
|
| |
| ▲ | hnfong 3 days ago | parent | prev [-] | | Look at history, things improve and then things get worse, in cycles. During the "things get worse" phase, why not make it shorter? | | |
| ▲ | jancsika 3 days ago | parent | next [-] | | Let's give it a shot. The year is 2003. Svn and cvs are proving to be way too clunky and slow for booming open source development. As an ethical accelerationist, you gain commit access to the repos for svn and cvs and make them slower and less reliable to accelerate progress toward better version control. Lo and behold, you still have to wait until 2025 for git to be released. Because git wasn't written to replace svn or cvs-- it was written as the result of internal kernel politics wrt access to a closed-source source management program Bitkeeper. And since svn and cvs were already bad enough that kernel devs didn't choose them, you making them worse wouldn't have affected their choice. Also, keep in mind that popularity of git was spurred by tools that converted from svn to git. So by making svn worse, you'd have made adoption of git harder by making it harder on open source devs to write reliable conversion tools. To me, this philosophy looks worse than simply doing nothing at all. And this is in a specific domain where you could at least make a plausible, constrained argument for accelerationism. Your comment instead seems to apply to accelerationism applied to software in general-- there, the odds of you being right are so infinitesimal as to be fatuous. In short, you'd do better playing the lottery because at least nothing bad happens to anyone else when you lose. | |
| ▲ | TeMPOraL 3 days ago | parent | prev | next [-] | | > During the "things get worse" phase, why not make it shorter? Because it never gets better for the people actually living through it. I imagine those in favor of the idea of accelerating collapse aren't all so purely selfless that they're willing to see themselves and their children suffer and die, all so someone elses' descendants can live in a better world. Nah, they just aren't thinking it through. | |
| ▲ | a96 2 days ago | parent | prev | next [-] | | There's no cycle. It's just a long slide with illusionary changes in between. | |
| ▲ | hobs 3 days ago | parent | prev [-] | | It doesn't foreshorten the cycle, it prolongs it and makes it worse. |
|
| |
| ▲ | nananana9 3 days ago | parent | prev [-] | | If you showed me the current state of YouTube 8 years ago - multiple unskippable ads before each video, 5 midrolls for a 10 minute video, comments overran with bots, video dislikes hidden, the shorts hell, the dysfunctional algorithm, .... - I would've definitely told you "Yep, that will be enough to kill it!" At this point I don't know - I still have the feeling that "they just need to make it 50% worse again and we'll get a competitor," but I've seen too many of these platforms get 50% worse too many times, and the network effect wins out every time. | | |
| ▲ | encom 3 days ago | parent [-] | | It's classic frog boiling. I want them (for whatever definition of "them") to just nuke the frog from orbit. |
|
|