> Very long segments that can be reproduced

Right, and very short segments can also be reproduced. Let's say that "//" is an arbitrarily short segment that matches some source code. This is trivially true. I could write "//" on a coin and half the time it's going to land "//". Let's agree that's a lower bound.

I don't even disagree that there is an upper bound. Surely reproducing a repo in its entirety is a match.

So there must exist a line between the two that divides too short and too long.

Again, by what basis do you draw a line between a 1 token reproduction and a 1,000 token reproduction? 5, 10, 20, 50? How is it justified? Purely "reasonableness"?

▲

Dylan16807 an hour ago | parent [-]

Why do you want me to pick a number so bad?

There are very very long examples that are clearly memorization.

Like, if a model was trained on all the code in the world except that specific example, the chance of it producing that snippet is less than a billionth of a billionth of a percent. But that snippet got fed in so many times it gets treated like a standard idiom and memorized in full.

Is that a clear enough threshold for you?

I don't know where the exact line is, but I know it's somewhere inside this big ballpark, and there are examples that go past the entire ballpark.

I don't care where specifically the bound is.

▲

kelseyfrog 41 minutes ago | parent [-]

Ok, 1 it is then.

	▲	Dylan16807 33 minutes ago \| parent [-]
		That is not good faith, my dude.