This is a multi-terabyte sized dice that is not at all random AND has most definitely copied the source code in question to begin with.

▲

kelseyfrog 2 hours ago | parent [-]

The die is certainly not multi-terabyte. A more realistic number would be 32k-sided to 50k-sided if we want to go with a pretty average token vocabulary size.

Really, it comes down to encoding. Arbitrarily short utf-8 encoded strings can be generated using a coin flip.

▲

Dylan16807 an hour ago | parent [-]

The number of sides has nothing to do with the data within. It's not random and sometimes it repeats things in an obviously non-chance way.

▲

kelseyfrog 17 minutes ago | parent [-]

Of course, it's random and by chance - tokens are literally sampled from a predicted probability distribution. If you mean chance=uniform probability you have to articulate that.

It's trivially true that arbitrarily short reconstructions can be reproduced by virtually any random process and reconstruction length scales with the similarity in output distribution to that of the target. This really shouldn't be controversial.

My point is that matching sequence length and distributional similarity are both quantifiable. Where do you draw the line?

▲

Dylan16807 10 minutes ago | parent [-]

> Of course, it's random and by chance - tokens are literally sampled from a predicted probability distribution.

Picking randomly out of a non-random distribution doesn't give you a random result.

And you don't have to use randomness to pick tokens.

> If you mean chance=uniform probability you have to articulate that.

Don't be a pain. This isn't about uniform distribution versus other generic distribution. This is about the very elaborate calculations that exist on a per-token basis specifically to make the next token plausible and exclude the vast majority of tokens.

> My point is that matching sequence length and distributional similarity are both quantifiable. Where do you draw the line?

Any reasonable line has examples that cross it from many models. Very long segments that can be reproduced. Because many models were trained in a way that overfits certain pieces of code and basically causes them to be memorized.

	▲	kelseyfrog 2 minutes ago \| parent [-]
		> Very long segments that can be reproduced Right, and very short segments can also be reproduced. Let's say that "//" is an arbitrarily short segment that matches some source code. This is trivial. Again, by what basis do you draw a line between a 1 token reproduction and a 1,000 token reproduction? 5, 10, 20, 50? How is it justified? Purely "reasonableness"?