| ▲ | The Company Quietly Funneling Paywalled Articles to AI Developers(theatlantic.com) | ||||||||||||||||||||||||||||||||||||||||||||||||||||
| 30 points by breve 2 days ago | 17 comments | |||||||||||||||||||||||||||||||||||||||||||||||||||||
| ▲ | veunes 15 hours ago | parent | next [-] | ||||||||||||||||||||||||||||||||||||||||||||||||||||
So basically Common Crawl is a data laundromat for Big Tech. They outsource their dirty and ethically questionable data collection to a "non-profit," and then act like they're just "researchers" using an "open" dataset. Those "donations" from OpenAI and Anthropic are just payment for plausible deniability | |||||||||||||||||||||||||||||||||||||||||||||||||||||
| ▲ | bookofjoe a day ago | parent | prev | next [-] | ||||||||||||||||||||||||||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||||||||||||||||||||||||||
| ▲ | gradientsrneat a day ago | parent | prev | next [-] | ||||||||||||||||||||||||||||||||||||||||||||||||||||
After reading about what Rupert Murdoch did in Australia to try to claw money from search engines for simply indexing pages from news websites, I do understand that it's possible to go too far in favor of the "news" organizations (whether they are reputable or not). I don't think the LLM companies are fully innocent here, to be fair. | |||||||||||||||||||||||||||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||||||||||||||||||||||||||
| ▲ | bgwalter a day ago | parent | prev | next [-] | ||||||||||||||||||||||||||||||||||||||||||||||||||||
Meanwhile, Common Crawl’s executive director, Rich Skrenta, has publicly made the case that AI models should be able to access anything on the internet. “The robots are people too,” he told me, and should therefore be allowed to “read the books” for free. The shamelessness of the propaganda reaches new heights. The industry shills no longer even attempt to make arguments, they just rely on people repeating their slogans. | |||||||||||||||||||||||||||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||||||||||||||||||||||||||
| ▲ | superkuh a day ago | parent | prev [-] | ||||||||||||||||||||||||||||||||||||||||||||||||||||
>“You shouldn’t have put your content on the internet if you didn’t want it to be on the internet,” This is absolutely the correct and original take. This modern corporate bending over backwards to try to appease the lawyers and pretend the web isn't public is the new and weird take. Seriously, if it's not supposed to be public then don't put it in public. When I send a HTTP request to a webserver on the public internet it is up to them to decide if they want to respond to that request. And it is 100% up to me what I do with the data in that request on my machine in private. >Common Crawl doesn’t log in to the websites it scrapes, but its scraper is immune to some of the paywall mechanisms used by news publishers. For example, on many news websites, you can briefly see the full text of any article before your web browser executes the paywall code that checks whether you’re a subscriber and hides the content if you’re not. This weasly idea above, that corporations get to decide how you display the HTML, is very, very dangerous to our society. It's as if visiting a website and downloading the publicly available contents is a nation setting up an embassy of "foreign soil" on your hardware that they control and you don't. Their cultural expectation is that you cannot do what you want with that data. Modifying it or how it's displayed is, to them, is like walking into their business location and moving around the displays. So obviously the only legal interface is the one they provide "at their location" or via another incorporated entity they associate with. But of course they aren't at their location they're at my location on my property in my PC. But slowly this commercial norm is working it's way into leglistation to become our reality as web attestation. What they see, and what they want, is a situation equal to you going to their business premise and sitting down at one of their machines. They want to own your computer in just the same way simply by you visiting a website. That shit's fucked. I'll turn off CSS and JS if want to and read the text if I want to on my computer in my RAM. If you don't want me doing that don't respond to the HTTP request. And stop trying to characterize all interactions on the web as between corporations. There are more of us human people than corporate people. Our use cases matter. Alex Reisner and The Atlantic should be ashamed of themselves. They obviously don't know what they're talking about and are just repeating a corporate PR line, or, at best, intentionally trying to create controversy out of nothing. | |||||||||||||||||||||||||||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||||||||||||||||||||||||||