I wonder how this will work with AI stuff generating code without any source or attribution. It’s not like the LLMs make this stuff up out of thin air it comes from source material.

▲

observationist 4 hours ago | parent | next [-]

Best case scenario is it nukes the whole concept of software patents and the whole ugly industry of copyright hoarding. The idea that perpetual rent-seeking is a natural extension and intended outcome of the legal concepts of copyrights and patents is bizarre.

	▲	LeFantome 2 hours ago \| parent [-]
		I cannot imagine it somehow impacts parents. Copyright and patents are completely independent concepts. The “perpetual” part is the issue but “rent seeking” is the entire reason that copyright and patents exist to begin with.

▲

alienbaby 4 hours ago | parent | prev | next [-]

Llm's do not verbatim disgorge chunks of the code they were trained on.

▲

perryprog 3 hours ago | parent | next [-]

I think it's probably less frequent nowadays, but it very much does happen. This still-active lawsuit[0] was made in response to LLMs generating verbatim chunks of code that they were trained on.[1]

[0] https://githubcopilotlitigation.com [1] https://www.theverge.com/2022/11/8/23446821/microsoft-openai...

▲

AshamedCaptain 2 hours ago | parent | prev | next [-]

You can still very trivially get entire chunks of code from Copilot including even literal author names (simply by prodding with a doxygen tag).

▲

neilv 2 hours ago | parent | prev | next [-]

They do, and, early on, Microsoft (and perhaps others) put in some checks to try to hide that.

▲

idle_zealot 3 hours ago | parent | prev | next [-]

Surely they do sometimes?

▲

kelseyfrog 3 hours ago | parent [-]

A 26-sided die reproduces chuncks of source code. What's the dividing line?

▲

AshamedCaptain 2 hours ago | parent | next [-]

This is a multi-terabyte sized dice that is not at all random AND has most definitely copied the source code in question to begin with.

▲

kelseyfrog 2 hours ago | parent [-]

The die is certainly not multi-terabyte. A more realistic number would be 32k-sided to 50k-sided if we want to go with a pretty average token vocabulary size.

Really, it comes down to encoding. Arbitrarily short utf-8 encoded strings can be generated using a coin flip.

▲

Dylan16807 an hour ago | parent [-]

The number of sides has nothing to do with the data within. It's not random and sometimes it repeats things in an obviously non-chance way.

▲

kelseyfrog 22 minutes ago | parent [-]

Of course, it's random and by chance - tokens are literally sampled from a predicted probability distribution. If you mean chance=uniform probability you have to articulate that.

It's trivially true that arbitrarily short reconstructions can be reproduced by virtually any random process and reconstruction length scales with the similarity in output distribution to that of the target. This really shouldn't be controversial.

My point is that matching sequence length and distributional similarity are both quantifiable. Where do you draw the line?

▲

Dylan16807 15 minutes ago | parent [-]

> Of course, it's random and by chance - tokens are literally sampled from a predicted probability distribution.

Picking randomly out of a non-random distribution doesn't give you a random result.

And you don't have to use randomness to pick tokens.

> If you mean chance=uniform probability you have to articulate that.

Don't be a pain. This isn't about uniform distribution versus other generic distribution. This is about the very elaborate calculations that exist on a per-token basis specifically to make the next token plausible and exclude the vast majority of tokens.

> My point is that matching sequence length and distributional similarity are both quantifiable. Where do you draw the line?

Any reasonable line has examples that cross it from many models. Very long segments that can be reproduced. Because many models were trained in a way that overfits certain pieces of code and basically causes them to be memorized.

	▲	kelseyfrog 7 minutes ago \| parent [-]
		> Very long segments that can be reproduced Right, and very short segments can also be reproduced. Let's say that "//" is an arbitrarily short segment that matches some source code. This is trivially true. I could write "//" on a coin and half the time it's going to land "//". Let's agree that's a lower bound. I don't even disagree that there is an upper bound. Surely reproducing a repo in its entirety is a match. So there must exist a line between the two that divides too short and too long. Again, by what basis do you draw a line between a 1 token reproduction and a 1,000 token reproduction? 5, 10, 20, 50? How is it justified? Purely "reasonableness"?

▲

afiori 3 hours ago | parent | prev [-]

IIRC at a point it was 6 line of code

▲

bobsmooth an hour ago | parent | prev [-]

ChatGPT has given me code with comments so specific I found the original 6 year old github.

▲

ranger_danger 3 hours ago | parent | prev | next [-]

Everything humans make up also comes from source material.

The real (legal) question in either case, is how much is actually copied, and how obvious is it.

▲

beardbound 2 hours ago | parent [-]

I mostly agree with you, but if a human straight up copies work under copyright they’re violating the law. Seems like a LLM should be held to the same standard unless they should be even less beholden to the law than people.

It’s also incredibly hard to tell if a LLM copied something since you can’t ask it in court and it probably can’t even tell you if it did.

	▲	ranger_danger 2 hours ago \| parent [-]
		From what I have seen, the (US) courts seem to make a distinction between 100% machine-automated output with no manual prompting at all, versus a human giving it specific instructions on what to generate. (And yes I realize everything a computer does requires prior instruction of some kind.) But the issue with copyright I think comes from the distribution of a (potentially derivative or transformative in the legal sense) work, which I would say is typically done manually by a human to some extent, so I think they would be on the hook for any potential violations in that case, possibly even if they cannot actually produce sources themselves since it was LLM-generated. But the legal test always seems to come back to what I said before, simply "how much was copied, and how obvious is it?" which is going to be up to the subjective interpretation of each judge of every case.

▲

userbinator 4 hours ago | parent | prev [-]

Everything is a derivative work.