I think you could have an LLM produce a written English detailed description of the complete logic of the program and tests.

Then use another LLM to produce code from that spec.

This would be similar to the cleanroom technique.

▲

simiones 4 hours ago | parent | next [-]

Producing a copy of a copyrighted work through a purely mechanical process is clear violation of copyright. LLMs are absolutely not different from a copier machine in the eyes of the law.

Original works can only be produced by a human being, by definition in copyright law. Any artifact produced by an animal, a mechanical process, a machine, a natural phenomenon etc is either a derived work if it started from an original copyrighted work, or a public domain artifact not covered by copyright law if it didn't.

For example, an image created on a rock struck by lightning is not a copyright covered work. Similarly, an image generated by an diffusion model from a randomly generated sentence is not a copyrightable work. However, if you feed a novel as a prompt to an LLM and ask for a summary, the resulting summary is a derived work of said novel, and it falls under the copyright of the novel's owner - you are not allowed to distribute copies of the summary the LLM generated for you.

Whether the output of an LLM, or the LLM weights themselves, might be considered derived works of the training set of that LLM is a completely different discussion, and one that has not yet been settled in court.

▲

robinsonb5 5 hours ago | parent | prev | next [-]

Perhaps - but an argument might still be made that the result is a derivative work of the original, given that it's produced by feeding the original work through automated tooling.

But either way, deleting the original version from the repo and replacing it with the new version - as opposed to, say, archiving the old version and starting a new repo with the new version - would still be a dick move.

▲

robin_reala 5 hours ago | parent | prev | next [-]

Assuming the second LLM hadn’t been trained on the existing codebase. Which in this case we can’t know, but can assume that it was.

▲

knollimar 5 hours ago | parent | prev [-]

Does the second LLM have the codebase in its training?

	▲	9864247888754 5 hours ago \| parent [-]
		One could use Comma, which has only been trained on public domain texts: https://arxiv.org/pdf/2506.05209