Remix.run Logo
bpodgursky 2 days ago

The use of paywalled scientific articles to train AI is one place where I think we have to just draw the line and say, this has to be allowed or US AI is simply going to get gutted and replaced by international competitors who have no respect for copyright law.

Sorry but this is just a competitive reality and the content matters A LOT. Sucks that Elsevier gambled badly on the scientific community putting up with overpriced subscriptions forever, but their concerns can't dictate national policy on this.

arjie 2 days ago | parent | next [-]

Absolutely agree. Realistically, everyone was playing around with this thing because everyone was using Sci Hub, /r/Scholar, and god knows what else to get PDFs. This is one of those things where the reality is well-known and people pretend that something is actually going on in copyright enforcement.

And if I'm being honest, I'm tired of the International Brotherhood of Stevedores[0] style of shredding human productivity to protect some special interest group. If Elsevier died tomorrow, we'd lose a curation function to scientific papers, true, but we wouldn't lose the science itself. And while the curation on scientific output is clearly valuable - China is suffering the lack of this while producing prodigious science - I think it's far less important than the scientific output itself. This is especially true of US science.

0: IBS, the AMA, pharmacists, teacher unions, firefighter unions, tax preparers: the distributed cost to society is huge because we decided on protecting these special interest groups. Blocking AI would be a bridge too far.

pjc50 a day ago | parent | prev | next [-]

So you end up paying an AI company (or subsist on not-endless free tokens) to circumvent another company's paywall? This doesn't sound like a sustainable solution.

How reliable is it? Can you just ask an AI for a doi and get a reasonably correct copy of the original article back? Is the level of hallucination induced in science acceptable?

Ferret7446 2 days ago | parent | prev | next [-]

I think this is one reason "piracy for AI" in general is tolerated. Anyone with a clear understanding of real world dynamics realizes that if a foreign country that lacks scruples develops "AGI", for lack of a better term, then you're fucked. This is in a sense a nuclear arms race.

The same applies between companies, by the way, hence the "AI bubble".

The other reason "piracy for AI" is tolerated is because it's not at all clear how to legislate or regulate it. You might think it's a cut and dry case, but lots of other people think the same about the opposite conclusion.

kmeisthax 2 days ago | parent | prev [-]

I agree, but only in the sense that I think any amount of copyright protection for scientific papers is absolutely absurd. The creativity involved in papers is minimal and a good chunk of that research is funded by the government, so paywalling it is criminally unethical.

Also, if we're going to bin the entire concept of copyright, can we at least be equal about it? I'd rather not live in a world where humans labor for the remnants of their culture in the content mines while clankers[0] feast on an endless stream of training data.

[0] Fake racial slur for robots or other AI systems.

zzo38computer 2 days ago | parent [-]

I agree. I think that copyright should be abolished entirely, especially for scientific articles (if they are good quality scientific research then I think they would be too important to be copyrighted, in addition to the other stuff you mention), but also for anything else too.

Nevertheless I thin there is another thing against the LLM training, which is that the scraping seems to be excessive (although it could be made less excessive; there are many ways to help with making it less excessive) and I think it requires too much power (although I don't really know a lot about it).

These are two separate issues, though.

jruohonen 2 days ago | parent [-]

> I think that copyright should be abolished entirely, especially for scientific articles

You know, it is really the CC-BY-style most science people care about. Same goes with MIT/BSD open source licenses, while with GPL I suppose it is one the side of CC-BY-SA.