To someone who believes that AI training data is built on the theft of people's labor, your second paragraph might sound like an 1800s plantation owner saying "can you imagine trying to explain to someone 100 years from now we tried to stop slavery because of civil rights". You're not addressing their point at all, just waving it away.

▲

refulgentis 4 hours ago | parent | next [-]

I appreciate a good debate. However, this won’t fit in one. It is tasteless, offensive, and stupid to compare storing the result of HTTP GET without paying someone to slavery in the 1800s. Full stop.

Anyone tempted to double down on this: sure, maybe, someday it’s like The Matrix or whatever. I was 12 when it came out & understood that was a fictional extreme. You do too. And you stumbled into a better analogy than slavery in 1800s.

▲

mmooss 4 hours ago | parent | next [-]

You're changing the subject. What about the actual point?

▲

refulgentis 3 hours ago | parent [-]

[flagged]

▲

beeflet 3 hours ago | parent [-]

Change the law so you can't train on copyrighted work without permission from the copyright holder.

>harassed

This just in, anonymous forum user SHOCKINGLY HARASSED, PELTED with HIGH-SPEED ideas and arguments, his positions BRUTALLY ATTACKED and PUBLICLY DEFACED.

	▲	anoncareer0212 2 hours ago \| parent [-]
		Been here for many years and haven’t seen behavior as boorish as this, especially from a self appointed debate club president. Post you’re replying to: Which is what? I’m honestly unsure. Could be: we need to nuke the data centers, or unseat any judge that has allowed this, or somehow move the law from “it’s cool to do matmuls with text as long as you have the right to read it.” Not against any of those but I’m sure I’m Other Team coded to you given the amount of harassment you’ve done in this thread to me and others.

▲

jakelazaroff 4 hours ago | parent | prev | next [-]

I mean, yeah, if you omit any objectionable detail and describe it in the most generic possible terms then of course the comparison sounds tasteless and offensive. Consider that collecting child pornography is also "storing the result of an HTTP GET".

▲

refulgentis 3 hours ago | parent | next [-]

What was the objectionable detail I forgot to include? Feeding the HTTP GET result to an AI? Then it’s the same as slavery? Sounds clearly wrong to me.

▲

jakelazaroff 3 hours ago | parent [-]

No, I pointed out that your attempt to straw man my comment was so overly broad that it also describes collecting child pornography. Why not engage specifically with what I'm saying?

▲

anoncareer0212 3 hours ago | parent [-]

What didn’t they engage with?

It’s really hard to parse this thread because you and the other gentleman keep telling anyone who engages they aren’t engaging.

You both seem worked up and perceiving others as disagreeing with you wholesale on the very concept that AI companies could be forced to compensate people for training data, and morally injuring you.

Your conduct to a point, but especially their conduct, goes far beyond what I’m used to on HN. I humbly suggest you decouple yourself a bit from them, you really did go too far with the slavery bit, and it was boorish to then make child porn analogy.

▲

jakelazaroff 3 hours ago | parent | next [-]

If you believe my conduct here is inappropriate, feel free to alert the mods. I think it's pretty obvious why describing someone's objections to AI training data as "storing the result of an HTTP GET" is not a good faith engagement.

	▲	anoncareer0212 2 hours ago \| parent [-]
		It’s not clear from anything either of you have written what the difference is between “AI training data” and “storing the result of an HTTP GET [and matmul’ing it]” is. All we have is an exquisite, thoughtful, nuanced, analogy of how it is exactly like America enslaving Black people in the 1800s. i.e. a cheap appeal to morality. Then, it is followed by repeated brow-beating comment to anyone who replied, complaining something wasn’t being engaged with. What exactly wasn’t being engaged with? It is still unclear. Do feel free to share, or apologize even. It’s understandable you went a bit too far because you really do feel it’s the same as slavers in the 1800s in America, what’s not understandable is complaining no one is engaging correctly.

▲

3 hours ago | parent | prev [-]

[deleted]

▲

ronsor 3 hours ago | parent | prev [-]

The objection to CSAM is rooted in how it is (inhumanely) produced; people are not merely objecting to a GET request.

	▲	beeflet 3 hours ago \| parent \| next [-]
		Yes, they're objecting to people training on data they don't have the right to, not just the GET request as you suggest. If you distribute child porn, that is a crime. But if you crawl every image on the web and then train a model that can then synthesize child porn, the current legal model apparently has no concept of this and it is treated completely differently. Generally, I am more interested in how this effects copyright. These AI companies just have free reign to convert copyrighted works into the public domain through the proxy of over-trained AI models. If you release something as GPL, they can strip the license, but the same is not true of closed-source code which isn't trained on.
	▲	jakelazaroff 3 hours ago \| parent \| prev [-]
		Indeed, and neither is that what people are objecting to with regard to AI training data.

▲

3 hours ago | parent | prev [-]

[deleted]

▲

anonym29 3 hours ago | parent | prev | next [-]

The difference is that people who write open source code or release art publicly on the internet from their comfortable air conditioned offices voluntarily chose to give away their work for free, while slaves were coerced to perform gruelling, brutal physical labor in horrific conditions against their will at gunpoint.

Basically the exact same thing.

▲

sirwhinesalot 3 hours ago | parent | next [-]

It's not free. There is a license attached. One you are supposed to follow and not doing so is against the law.

▲

anonym29 3 hours ago | parent [-]

There's a deeper discussion here about property rights, about shrinkwrap licensing, about the difference between "learning from" vs "copying", about the realpolitik of software licensing agreements, about how, if you actually wanted to protect your intellectual property (stated preference), you might be expected to make your software proprietary and not deliberately distribute instructions on how to reproduce an exact replica of it in order to benefit from the network effects of open distribution (revealed preference) - about wanting to have your cake and eat it too, but I'd be remiss to not point out that your username is not doing your credibility any favors here.

▲

sirwhinesalot 3 hours ago | parent | next [-]

I'm not whining in this case, just pointing out "they gave it out for free" is completely false, at the very least for the GNU types. It was always meant to come with plenty of strings attached, and when those strings were dodged new strings were added (GPL3, AGPL).

If I had a photographic memory and I used it to replicate parts of GPLed software verbatim while erasing the license, I could not excuse it in court that I simply "learned from" the examples.

Some companies outright bar their employees from reading GPLed code because they see it as too high of a liability. But if a computer does it, then suddenly it is a-ok. Apparently according to the courts too.

If you're going to allow copyright laundering, at least allow it for both humans and computers. It's only fair.

▲

shkkmo 2 hours ago | parent [-]

> If I had a photographic memory and I used it to replicate parts of GPLed software verbatim while erasing the license, I could not excuse it in court that I simply "learned from" the examples.

Right, because you would have done more than learning, you would have then gone past learning and used that learning to reproduce the work.

It works exactly the same for a LLM. Training the model on content you have legal access to is fine. Aftwards, somone using that model to produce a replica of that content is engaged in copyright enfringement.

You seem set on conflating the act of learning with the act of reproduction. You are allowed to learn from copyrighted works you have legal access to, you just aren't allowed to duplicate those works.

▲

sirwhinesalot 2 hours ago | parent | next [-]

The problem is that it's not the user of the LLM doing the reproduction, the LLM provider is. The tokens the LLM is spitting out are coming from the LLM provider. It is the provider that is reproducing the code.

If someone hires me to write some code, and I give them GPLed code (without telling them it is GPLed), I'm the one who broke the license, not them.

	▲	shkkmo 2 hours ago \| parent [-]
		> The problem is that it's not the user of the LLM doing the reproduction, the LLM provider is. I don't think this is legally true. The law isn't fully settled here, but things seem to be moving towards the LLM user being the holder of the copyright of any work produced by that user prompting the LLM. It seems like this would also place the enfringement onus on the user, not the provider. > If someone hires me to write some code, and I give them GPLed code (without telling them it is GPLed), I'm the one who broke the license, not them. If you produce code using a LLM, you (probably) own the copyright. If that code is already GPL'd, you would be the one engaged in enfringement.

▲

zephen 2 hours ago | parent | prev [-]

You seem set on conflating "training" an LLM with "learning" by a human.

LLMs don't "learn" but they _do_ in some cases, faithfully regurgitate what they have been trained on.

Legally, we call that "making a copy."

But don't take my word for it. There are plenty of lawsuits for you to follow on this subject.

▲

shkkmo 2 hours ago | parent [-]

> You seem set on conflating "training" an LLM with "learning" by a human.

"Learning" is an established word for this, happy to stick with "training" if that helps your comprehension.

> LLMs don't "learn" but they _do_ in some cases, faithfully regurgitate what they have been trained on.

> Legally, we call that "making a copy."

Yes, when you use a LLM to make a copy .. that is making a copy.

When you train a LLM... That isn't making a copy, that is training. No copy is created until output is generated that contains a copy.

	▲	zephen an hour ago \| parent [-]
		> Learning" is an established word for this Only by people attempting to muddy the waters. > happy to stick with "training" if that helps your comprehension. And supercilious dickheads (though that is often redundant). > No copy is created until output is generated that contains a copy. The copy exists, albeit not in human-discernable form, inside the LLM, else it could not be generated on demand. Despite you claiming that "It works exactly the same for a LLM," no, it doesn't.

▲

michaelsshaw 3 hours ago | parent | prev [-]

We spread free software for multiple purposes, one of them being the free software ethos. People using that for training proprietary models is antithetical to such ideas.

It's also an interesting double standard, wherein if I were to steal OpenAI's models, no AI worshippers would have any issue condemning my action, but when a large company clearly violates the license terms of free software, you give them a pass.

▲

ronsor 3 hours ago | parent | next [-]

> I were to steal OpenAI's models, no AI worshippers would have any issue condemning my action

If GPT-5 were "open sourced", I don't think the vast majority of AI users would seriously object.

	▲	sirwhinesalot 3 hours ago \| parent [-]
		OpenAI got really pissy about DeepSeek using other LLMs to train though. Which is funny since that's a much clearer case of "learning from" than outright compressing all open source code into a giant pile of weights by learning a low-dimensional probability distribution of token sequences.

▲

anonym29 an hour ago | parent | prev [-]

I can't speak for anyone else, but if you were to leak weights for OpenAI's frontier models, I'd offer to hug you and donate money to you.

Information wants to be free.

▲

jakelazaroff 3 hours ago | parent | prev | next [-]

> The difference is that people who write open source code or release art publicly on the internet from their comfortable air conditioned offices voluntarily chose to give away their work for free

That is not nearly the extent of AI training data (e.g. OpenAI training its image models on Studio Ghibli art). But if by "gave their work away for free" you mean "allowed others to make [proprietary] derivative works", then that is in many cases simply not true (e.g. GPL software, or artists who publish work protected by copyright).

▲

grandinquistor 3 hours ago | parent | prev | next [-]

What? Over 183K books were pirated by these big tech companies to train their models. They knew what they were doing was wrong.

▲

michaelsshaw 3 hours ago | parent | prev [-]

Perhaps you should Google the definition of metaphor before commenting.

▲

ronsor 4 hours ago | parent | prev [-]

> believes that AI training data is built on the theft of people's labor

I mean, this is an ideological point. It's not based in reason, won't be changed by reason, and is really only a signal to end the engagement with the other party. There's no way to address the point other than agreeing with them, which doesn't make for much of a debate.

> an 1800s plantation owner saying "can you imagine trying to explain to someone 100 years from now we tried to stop slavery because of civil rights"

I understand this is just an analogy, but for others: people who genuinely compare AI training data to slavery will have their opinions discarded immediately.

▲

zaptheimpaler 3 hours ago | parent | next [-]

We have clear evidence that millions of copyrighted books have been used as training data because LLMs can reproduce sections from them verbatim (and emails from employees literally admitting to scraping the data). We have evidence of LLMs reproducing code from github that was never ever released with a license that would permit their use. We know this is illegal. What about any of this is ideological and unreasonable? It's a CRYSTAL CLEAR violation of the law and everyone just shrugs it off because technology or some shit.

▲

ReflectedImage 2 hours ago | parent | next [-]

All creative types train on other creative's work. People don't create award winning novels or art pieces from scratch. They steal ideas and concepts from other people's work.

The idea that they are coming up with all this stuff from scratch is Public Relations bs. Like Arnold Schwarzenegger never taking steroids, only believable if you know nothing about body building.

	▲	oreally 2 hours ago \| parent \| next [-]
		Precisely. Nothing is truly original. To talk as though there's an abstract ownership over even an observation of the thing that force people to pay rent to use.. well artists definitely don't pay to whoever invented perspective drawings, programmers don't pay the programming language's creator. People don't pay newton and his descendants for making something that makes use of gravity. Copyright has always been counterproductive in many ways. To go into details though, under copyright law there's a clause for "fair use" under a "transformative" criteria. This allows things like satire, reaction videos to exist. So long as you don't replicate 1-to-1 in product and purpose IMO it's qualifies as tasteful use.
	▲	zaptheimpaler an hour ago \| parent \| prev [-]
		What the fuck? People also need to pay to access that creative work if the rights owner charges for it, and they are also committing an illegal act if they don't. The LLM makers are doing this illegal act billions of times over for something approximating all creative work in existence. I'm not arguing that creative's make things in a vacuum, this is completely besides the point.

▲

shkkmo 2 hours ago | parent | prev | next [-]

You keep conflating different things.

> We have evidence of LLMs reproducing code from github that was never ever released with a license that would permit their use. We know this is illegal.

What is illegal about it? You are allowed to read and learn from publicly available unlicensed code. If you use that learning to produce a copy of those works, that is enfringement.

Meta clearly enganged in copyright enfringement when they torrented books that they hadn't purchased. That is enfringement already before they started training on the data. That doesn't make the training itself enfringement though.

	▲	zaptheimpaler an hour ago \| parent [-]
		> Meta clearly enganged in copyright enfringement when they torrented books that they hadn't purchased. That is enfringement already before they started training on the data. That doesn't make the training itself enfringement though. What kind of bullshit argument is this? Really? Works created using illegally obtained copyrighted material are themselves considered to be infringing as well. It's called derivative infringment. This is both common sense and law. Even if not, you agree that they infringed on copyright of something close to all copyrighted works on the internet and this sounds fine to you? The consequences and fines from that would kill any company if they actually had to face them.

▲

Alex2037 3 hours ago | parent | prev [-]

>We know this is illegal

>It's a CRYSTAL CLEAR violation of the law

in the court of reddit's public opinion, perhaps.

there is, as far as I can tell, no definite ruling about whether training is a copyright violation.

and even if there was, US law is not global law. China, notably, doesn't give a flying fuck. kill American AI companies and you will hand the market over to China. that is why "everyone just shrugs it off".

▲

zaptheimpaler an hour ago | parent | next [-]

China is doing human gene editing and embryo cloning too, we should get right on that. They're harvesting organs from a captive population too, we should do that as well otherwise we might fall behind on transplants & all the money & science involved with that. Lots of countries have drafts and mandatory military service too. This is the zero-morality darwinian view, all is fair in competition. In this view, any stealing that China or anyone does is perfectly fine too because they too need to compete with the US.

▲

goatlover 2 hours ago | parent | prev [-]

The "China will win the AI race" if we in the West (America) don't is an excuse created by those who started the race in Silicon Valley. It's like America saying it had to win the nuclear arms race, when physicists like Oppenheimer back in the late 1940s were wanting to prevent it once they understood the consequences.

	▲	Alex2037 2 hours ago \| parent [-]
		okay, and? what do you picture happening if Western AI companies cease to operate tomorrow and fire all their researchers and engineers?

▲

mmooss 4 hours ago | parent | prev | next [-]

It's very much based on reason and law.

> There's no way to address the point

That's you quitting the discussion and refusing to engage, not them.

> have their opinions discarded immediately.

You dismiss people who disagree and quit twice in one comment.

▲

tombert 3 hours ago | parent | next [-]

> It's very much based on reason and law.

I have no interest in the rest of this argument, but I think I take a bit of issue on this particular point. I don't think the law is fully settled on this in any jurisdiction, but certainly not in the United States.

"Reason" is a more nebulous term; I don't think that training data is inherently "theft", any more than inspiration would be even before generative AI. There's probably not an animator alive that wasn't at least partially inspired by the works of Disney, but I don't think that implies that somehow all animations are "stolen" from Disney just because of that fact.

Obviously where you draw the line on this is obviously subjective, and I've gone back and forth, but I find it really annoying that everyone is acting like this is so clear cut. Evil corporations like Disney have been trying to use this logic for decades to try and abuse copyright and outlaw being inspired by anything.

	▲	mmooss 3 hours ago \| parent [-]
		It can be based on reason and law without being clear cut - that situation applies to most of reason and law. > I don't think that training data is inherently "theft", any more than inspiration would be even before generative AI. There's probably not an animator alive that wasn't at least partially inspired by the works of Disney ... Sure, but you can reason about it, such as by using analogies.

▲

refulgentis 4 hours ago | parent | prev [-]

[flagged]

▲

beepbooptheory 3 hours ago | parent | prev | next [-]

What makes something more or less ideological for you in this context? Is "reason" always opposed to ideology for you? What is the ideology at play here for the critics?

▲

zwnow 4 hours ago | parent | prev [-]

> I mean, this is an ideological point. It's not based in reason

You cant be serious