| ▲ | SlinkyOnStairs 4 hours ago | |
Devil's advocate: Anyone seeking to limit AI scraping doesn't have much of a choice in also blocking archivists. And it's genuinely not that weird for news organisations to want to stop AI scraping. This is just a repeat of their fight with social media embedding. Sure. The back catalogue should be as close to public domain as possible, libraries keeping those records is incredibly important for research. But with current news, that becomes complicated as taking the articles and not paying the subscription (or viewing their ads) directly takes away the revenue streams that newsrooms rely on to produce the news. Hence the "Newspaper trying to ban linking" mess, which was never about the links themselves but about social media sites embedding the headline and a snippet, which in turn made all the users stop clicking through and "paying" for the article. Social media relies on those newsrooms (same with really, most other kinds of websites) to provide a lot of their content. And AI relies on them for all of the training data (remember: "Synthetic data" does not appear ex nihilo) & to provide the news that the AI users request. We can't just let the newsrooms die. The newsroom hasn't been replaced itself, it's revenue has been destroyed. --- And so, the question of archives pops up. Because yes, you can with some difficulty block out the AI bots, even the social media bots. A paywall suffices. But this kills archiving. Yet if you whitelist the archives in some way, the AI scrapers will just pull their data out of the archive instead and the newsrooms still die. (Which also makes the archiving moot) A compromise solution might be for archives to accept/publish things on a delay, keep the AI companies from taking the current news without paying up, but still granting everyone access to stuff from decades ago. There's just major disagreement about what a reasonable delay is. Most major news orgs and other such IP-holders are pretty upset about AI firm's "steal first, ask permission later" approach. Several AI firms setting the standard that training data is to be paid for doesn't help here either. In paying for training data they've created a significant market for archives, and significant incentive to not make them publicly freely accessible. Why would The Times ever hand over their catalogue to the Internet Archive if Amazon will pay them a significant sum of money for it? The greater good of all humanity? Good luck getting that from a dying industry. --- Tangent: Another annoying wrinkle in the financial incentives here is that not all archiving organisations are engaging in fair play, which yet further pushes people to obstruct their work. To cite a HN-relevant example: Source code archivist "Software Heritage" has long engaged in holding a copy of all the sourcecode they can get their hands on, regardless of it's license. If it's ever been on github, odds are they're distributing it. Even when licenses explicitly forbid that. (This is, of course, perfectly legal in the case of actual research and other fair use. But:) They were notable involved in HuggingFace's "The Stack" project by sharing a their archives ... and received money from HuggingFace. While the latter is nominally a donation, this is in effect a sale. --- I find it quite displeasing that the EFF fails to identify the incentives at play here. Simply trying to nag everyone into "doing the thing for the greater good!" is loathsome and doesn't work. Unless we change this incentive structure, the outcome won't change. | ||
| ▲ | Obscurity4340 3 hours ago | parent | next [-] | |
It would be better if there was some arrangement the papers could reach with Archive where they just delay the release or wait a week then its part of the archive. That way, news stuff gets paid for when its hot and fresh but then it gets archived and the record is preserved | ||
| ▲ | onetokeoverthe 4 hours ago | parent | prev [-] | |
[dead] | ||