Remix.run Logo
sigseg1v 2 hours ago

What does derivative mean here? Because IMO it means that the existing work was used as input. So if you used a LLM and it was trained on the existing work, that's a derivative work. If you rot13 encode something as input, so you can't personally read it, and then a device decides to rot13 on it again and output it, that's a derivative work.

spullara 2 hours ago | parent | next [-]

In order for it to be creatively derivative you would need to copy the structure, logic, organization, and sequence of operations not just reimplement the functionality. It is pretty clear in this case that wasn't done.

ghostpepper 2 hours ago | parent | prev | next [-]

As a cynical person I assume all the frontier LLMs were trained on datasets that include every open source project, but as a thought experiment, if an LLM was trained on a dataset that included every open source project _execept_ chardet, do you think said LLM would still be able to easily implement something very similar?

spullara 2 hours ago | parent [-]

There is no doubt in my mind that it could still do it.

nicole_express 2 hours ago | parent | prev | next [-]

Of course, the problem with this interpretation is that all modern LLMs are derivatives from huge amounts of text under completely different licenses, including "All rights reserved", and therefore can not be used for any purpose.

I'm not sure how you square the circle of "it's alright to use the LLM to write code, unless the code is a rewrite of an open source project to change its license".

satvikpendem 2 hours ago | parent | prev | next [-]

> Because IMO it means that the existing work was used as input

That's your opinion (since you said "IMO"), not the actual legal definition.

bmcahren 2 hours ago | parent | prev | next [-]

LLMs do not encode nor encrypt their training data. The fact they can recite training data is a defect not a default. You can understand this more simply by calculating the model size as an inverse of a fantasy compression algorithm that is 50% better than SOTA. You'll find you'd still be missing 80-90% of the training data even if it were as much of a stochastic parrot as you may be implying. The outputs of AI are not derivative just because they saw training data including the original library.

Then onto prompting: 'He fed only the API and (his) test suite to Claude'

This is Google v Oracle all over again - are APIs copyrightable?

satvikpendem 2 hours ago | parent [-]

> This is Google v Oracle all over again - are APIs copyrightable?

Yes this is the best way to ask the question. If I take a public facing API and reimplement everything, whether it's by human or machine, it should be sufficient. After all, that's what Google did, and it's not like their engineers never read a single line of the Java source code. Even in "clean room" implementations, a human might still have remembered or recalled a previous implementation of some function they had encountered before.

wizzwizz4 2 hours ago | parent | prev [-]

See also: https://monolith.sourceforge.net/, which seeks to ask the question:

> But how far away from direct and explicit representations do we have to go before copyright no longer applies?