> While this likely has no legal weight

I wouldn't be quite so sure about that. The AI industry has entirely relied on 'move fast and break things' and 'old fart judges who don't understand the tech' as their legal strategy.

The idea that AI training is fair use isn't so obvious, and quite frankly is entirely ridiculous in a world where AI companies pay for the data. If it's not fair use to take reddit's data, it's not fair use to take mine either.

On a technological level the difference to prior ML is straightforward: A classical classifier system is simply incapable of emitting any copyrighted work it was trained on. The very architecture of the system guarantees it to produce new information derived from the training data rather than the training data itself.

LLMs and similar generative AI do not have that safeguard. To be practically useful they have to be capable of emitting facts from training data, but have no architectural mechanism to separate facts from expressions. For them to be capable of emitting facts they must also be capable of emitting expressions, and thus, copyright violation.

Add in how GenAI tends to directly compete with the market of the works used as training data in ways that prior "fair use" systems did not and things become sketchy quickly.

Every major AI company knows this, as they have rushed to implement copyright filtering systems once people started pointing out instances of copyrighted expressions being reproduced by AI systems. (There are technical reasons why this isn't a very good solution to curtail copyright infringement by AI)

Observe how all the major copyright victories amount to judges dismissing cases on grounds of "Well you don't have an example specific to your work" rather than addressing whether such uses are acceptable as a collective whole.

▲

visarga 2 days ago | parent | next [-]

> but have no architectural mechanism to separate facts from expressions

Sure they do. Every time a bot searches, reads your site and formulates an answer it does not replicate your expression. First of all, it compares across 20.. 100 sources. Second, it only reports what is related to the user query. And third - it uses its own expression. It's more like asking a friend who read those articles and getting an answer.

LLMs ability to separate facts from expression is quite well developed, maybe their strongest skill. They can translate, paraphrase, summarize, or reword forever.

▲

PhantomHour 2 days ago | parent | next [-]

This is a baseless assertion of emergent behaviour.

> Every time a bot searches

We are talking about LLMs by themselves, not larger systems using them.

> LLMs ability to separate facts from expression is quite well developed

It is not. Whether you ask an LLM for an excerpt of the bible, or an excerpt of The Lord of the Rings, the LLM does not distinguish. It has no concept of what is, and what is not, under copyright.

▲

squigz 2 days ago | parent | prev [-]

> LLMs ability to separate facts from expression is quite well developed, maybe their strongest skill.

There should presumably be data showing the reliability of LLMs' knowledge to be quite high, then?

	▲	ndriscoll 2 days ago \| parent [-]
		I don't see how that follows. It can learn a false "fact" while not retaining the way that statement was expressed. It can also just make up facts entirely, which by definition then did not come from any training data.

▲

orangecat 2 days ago | parent | prev | next [-]

'old fart judges who don't understand the tech'

If this intended to refer to Judge Alsup, it is extremely wrong.

	▲	PhantomHour 2 days ago \| parent [-]
		It is not.

▲

HarHarVeryFunny 2 days ago | parent | prev | next [-]

> The idea that AI training is fair use isn't so obvious

> Observe how all the major copyright victories amount to judges dismissing cases on grounds of "Well you don't have an example specific to your work" rather than addressing whether such uses are acceptable as a collective whole.

Well, all a judge can/should do is to apply current law to the case before them. In the case of generative AI then it seems that it's mostly going to be copyright and "right of publicity" (reproducing someone else's likeness/voice) that apply.

Copyright infringment is all about having published something based on someone else's work - AFAIK it doesn't have anything to say about someone/something having the potential to infringe (e.g. training an AI) if they haven't actually done it. It has to be about the generated artifact.

Of course copyright law wasn't designed with generative AI in mind, and maybe now that it is here we need new laws to protect creative content. For example, should OpenAI be able to copy Studio Ghibli's "trademark" style without requiring permission?

	▲	PhantomHour 2 days ago \| parent [-]
		> Well, all a judge can/should do is to apply current law to the case before them This is true, and I do not mean to suggest it is bad. But rather, that it leaves uncertainty. These cases can all be struck down without reducing the possibility that if one does stick, the entire industry is at stake. > Copyright infringment is all about having published something based on someone else's work - AFAIK it doesn't have anything to say about someone/something having the potential to infringe (e.g. training an AI) if they haven't actually done it. It has to be about the generated artifact. A notable problem here is that AI models are not "standalone products" but tools provided as a service. This complicates the situation. Take Disney/Universal's case against Midjourney, which is both about the models but also the provision of services. Even if only the latter gets deemed illegal, that's ruinous for the big AI companies. What good is OpenAI if they can't provide ChatGPT? Who would license a LLM if the act of using it creates constant legal risks?

▲

janalsncm 2 days ago | parent | prev [-]

> The very architecture of the system guarantees it to produce new information derived from the training data rather than the training data itself

A “classical” classifier can regurgitate its training data as well. It’s just that Reddit never seemed to care about people training e.g. sentiment classifiers on their data before.

In fact a “decoder” is simply autoregressive token classification.