Remix.run Logo
Dylan16807 2 hours ago

> Of course, it's random and by chance - tokens are literally sampled from a predicted probability distribution.

Picking randomly out of a non-random distribution doesn't give you a random result.

And you don't have to use randomness to pick tokens.

> If you mean chance=uniform probability you have to articulate that.

Don't be a pain. This isn't about uniform distribution versus other generic distribution. This is about the very elaborate calculations that exist on a per-token basis specifically to make the next token plausible and exclude the vast majority of tokens.

> My point is that matching sequence length and distributional similarity are both quantifiable. Where do you draw the line?

Any reasonable line has examples that cross it from many models. Very long segments that can be reproduced. Because many models were trained in a way that overfits certain pieces of code and basically causes them to be memorized.

kelseyfrog 2 hours ago | parent [-]

> Very long segments that can be reproduced

Right, and very short segments can also be reproduced. Let's say that "//" is an arbitrarily short segment that matches some source code. This is trivially true. I could write "//" on a coin and half the time it's going to land "//". Let's agree that's a lower bound.

I don't even disagree that there is an upper bound. Surely reproducing a repo in its entirety is a match.

So there must exist a line between the two that divides too short and too long.

Again, by what basis do you draw a line between a 1 token reproduction and a 1,000 token reproduction? 5, 10, 20, 50? How is it justified? Purely "reasonableness"?

Dylan16807 an hour ago | parent [-]

Why do you want me to pick a number so bad?

There are very very long examples that are clearly memorization.

Like, if a model was trained on all the code in the world except that specific example, the chance of it producing that snippet is less than a billionth of a billionth of a percent. But that snippet got fed in so many times it gets treated like a standard idiom and memorized in full.

Is that a clear enough threshold for you?

I don't know where the exact line is, but I know it's somewhere inside this big ballpark, and there are examples that go past the entire ballpark.

I don't care where specifically the bound is.

kelseyfrog 40 minutes ago | parent [-]

Ok, 1 it is then.

Dylan16807 31 minutes ago | parent [-]

That is not good faith, my dude.