Remix.run Logo
slyall 7 months ago

The problem is the anti-AI people who complain about AI are going for several steps in the chain (and often they are vague about which ones they are talking about at any point).

As well as the "copying" of content some are also claiming that the output of a LLM should result in paying royalties back to the owning of the material used in training.

So if an AI produces a sitcom script then the copyright holders of those tv shows it ingested should get paid royalties. In additional to the money paid to copy files around.

Which leads to the precedent that if a writer creates a sitcom then the copyright holders of sitcoms she watched should get paid for "training" her.

jashmatthews 7 months ago | parent | next [-]

When humans learn and copy too closely we call that plagiarism. If an LLM does it how should we deal with that?

chii 7 months ago | parent [-]

> If an LLM does it how should we deal with that?

why not deal with it the same way as humans have been dealt with in the past?

If you copied an art piece using photoshop, you would've violated copyright. Photoshop (and adobe) itself never committed copyright violations.

Somehow, if you swap photoshop with openAI and chatGPT, then people claim that the actual application itself is a copyright violation.

dijksterhuis 7 months ago | parent [-]

this isn’t the same.

> If you copied an art piece using photoshop, you would've violated copyright. Photoshop (and adobe) itself never committed copyright violations.

the COPYing is happening on your local machine with non-cloud versions of Photoshop.

you are making a copy, using a tool, and then distributing that copy.

in music royalty terms, the making a copy is the Mechanical right, while distributing the copy is the Performing right.

and you are liable in this case.

> Somehow, if you swap photoshop with openAI and chatGPT, then people claim that the actual application itself is a copyright violation

OpenAI make a copy of the original works to create training data.

when the original works are reproduced verbatim (memorisation in LLMs is a thing), then that is the copyrighted work being distributed.

mechanical and performing rights, again.

but the twist is that ChatGPT does the copying on their servers and delivers it to your device.

they are creating a new copy and distributing that copy.

which makes them liable.

you are right that “ChatGPT” is just a tool.

however, the interesting legal grey area with this is — are ChatGPT model weights an encoded copy of the copyrighted works?

that’s where the conversation about the tool itself being a copyright violation comes in.

photoshop provides no mechanism to recite The Art Of War out of the box. an LLM could be trained to do so (like, it’s a hypothetical example but hopefully you get the point).

chii 7 months ago | parent [-]

> OpenAI make a copy of the original works to create training data.

if a user is allowed to download said copy to view on their browser, why isn't that same right given to openAI to download a copy to view for them? What openAI chooses to do with the viewed information is up to them - such as distilling summary statistics, or whatever.

> are ChatGPT model weights an encoded copy of the copyrighted works? that is indeed the most interesting legal gray area. I personally believe that it is not. The information distilled from those works do not constitute any copyrightable information, as it is not literary, but informational.

It's irrelevant that you could recover the original works from these weights - you could recover the same original works from the digits of pi!

dijksterhuis 7 months ago | parent [-]

heads up: you may want to edit your second quote

> if a user is allowed to download said copy to view on their browser, why isn't that same right given to openAI to download a copy to view for them?

whether you can download a copy from your browser doesn’t matter. whether the work is registered as copyrighted does (and following on from that, who is distributing the work - aka allowing you to download the copy - and for what purposes).

from the article (on phone cba to grab a quote) it makes clear that the Intercept’s works were not registered as copyrighted works with whatever the name of the US copyright office was.

ergo, those works are not copyrighted and, yes, they essentially are public domain and no remuneration is required …

(they cannot remove DMCA attribution information when distributing copies of the works though, which is what the case is now about.)

but for all the other registered works that OpenAI has downloaded, creating their copy, used in training data, which the model then reproduces as a memorised copy — that is copyright infringement.

like, in case it’s not clear, i’ve been responding to what people are saying about copyright specifically. not this specific case.

> The information distilled from those works do not constitute any copyrightable information, as it is not literary, but informational.

that’s one argument.

my argument would be it is a form of compression/decompression when the model weights result in memorised (read: overfitted) training data being regurgitated verbatim.

put the specific prompt in, you get the decompressed copy out the other end.

it’s like a zip file you download with a new album of music. except, in this case, instead of double clicking on the file you have to type in a prompt to get the decompressed audio files (or text in LLM case)

> It's irrelevant that you could recover the original works from these weights - you could recover the same original works from the digits of pi!

actually, that’s the whole point of courts ruling on this.

the boundaries of what is considered reproduction is at question. it is up to the courts to decide on the red lines (probably blurry gray areas for a while).

if i specifically ask a model to reproduce an exact song… is that different to the model doing it accidentally?

i don’t think so. but a court might see it differently.

as someone who worked in music copyright, is a musician, sees the effects of people stealing musicians efforts all the time, i hope the little guys come out of this on top.

sadly, they usually don’t.

dijksterhuis 7 months ago | parent | prev [-]

i’ve been avoiding replying to your comment for a bit, and now i realised why.

edit: i am so sorry about the wall of text.

> some are also claiming that the output of a LLM should result in paying royalties back to the owning of the material used in training.

> So if an AI produces a sitcom script then the copyright holders of those tv shows it ingested should get paid royalties. In additional to the money paid to copy files around.

what you’re talking about here is the concept of “derivative works” made from other, source works.

this is subtly different to reproduction of a work.

see the last half of this comment for my thoughts on what the interesting thing courts need to work out regarding verbatim reproduction https://news.ycombinator.com/item?id=42282003

in the derivative works case, it’s slightly different.

sampling in music is the best example i’ve got for this.

if i take four popular songs, cut 10 seconds of each, and then join each of the bits together to create a new track — that is a new, derivative work.

but i have not sufficiently modified the source works. they are clearly recognisable. i am just using copyrighted material in a really obvious way. the core of my “new” work is actually just four reproductions of the work of other people.

in that case — that derivative work, under music copyright law, requires the original copyright rights holders to be paid for all usage and copying of their works.

basically, a royalty split gets agreed, or there’s a court case. and then there’s a royalty split anyway (probably some damages too).

in my case, when i make music with samples, i make sure i mangle and process those samples until the source work is no longer recognisable. i’ve legit made it part of my workflow.

it’s no longer the original copyrighted work. it’s something completely new and fully unrecognisable.

the issue with LLMs, not just ChatGpt, is that they will reproduce both verbatim and recognisably similar output to original source works.

the original source copyrighted work is clearly recognisable, even if not an exact verbatim copy.

and that’s what you’ve probably seen folks talking about, at least it sounds like it to me.

> Which leads to the precedent that if a writer creates a sitcom then the copyright holders of sitcoms she watched should get paid for "training" her.

robin thicke “blurred lines” —

* https://en.m.wikipedia.org/wiki/Pharrell_Williams_v._Bridgep...

* https://en.m.wikipedia.org/wiki/Blurred_Lines (scroll down)

yes, there is already some very limited precedent, at least for a narrow specific case involving sheet music in the US.

the TL;DR IANAL version of the question at hand in the case was “did the defendants write the song with the intention of replicating a hook from the plaintiff’s work”.

the jury decided, yes they did.

this is different to your example in that they specifically went out to replicate the that specific musical component of a song.

in your example, you’re talking about someone having “watched” a thing one time and then having to pay royalties to those people as a result.

that’s more akin to “being inspired” by, and is protected under US law i think IANAL. it came up in blurred lines, but, well, yeah. https://en.m.wikipedia.org/wiki/Idea%E2%80%93expression_dist...

again, the red line of infringement / not infringement is ultimately up to the courts to rule on.

anyway, this is very different to what openAi/chatGpt is doing.

openAi takes the works. chatgpt edits them according to user requests (feed forward through the model). then the output is distributed to the user. and that output could be considered to be a derivative work (see massive amount of text i wrote above, i’m sorry).

LLMs aren’t sitting there going “i feel like recreating a marvin gaye song”. it takes data, encodes/decodes it, then produces an output. it is a mechanical process, not a creative one. there’s no ideas here. no inspiration or expression.

an LLM is not a human being. it is a tool, which creates outputs that are often strikingly similar to source copyrighted works.

their users might be specifically asking to replicate songs though. in which case, openAi could be facilitating copyright infringement (wether through derivative works or not).

and that’s an interesting legal question by itself. are they facilitating the production of derivative works through the copying of copyrighted source works?

i would say they are. and, in some cases, the derivative works are obviously derived.