| ▲ | kelseyfrog 2 hours ago | ||||||||||||||||
> Very long segments that can be reproduced Right, and very short segments can also be reproduced. Let's say that "//" is an arbitrarily short segment that matches some source code. This is trivially true. I could write "//" on a coin and half the time it's going to land "//". Let's agree that's a lower bound. I don't even disagree that there is an upper bound. Surely reproducing a repo in its entirety is a match. So there must exist a line between the two that divides too short and too long. Again, by what basis do you draw a line between a 1 token reproduction and a 1,000 token reproduction? 5, 10, 20, 50? How is it justified? Purely "reasonableness"? | |||||||||||||||||
| ▲ | Dylan16807 an hour ago | parent [-] | ||||||||||||||||
Why do you want me to pick a number so bad? There are very very long examples that are clearly memorization. Like, if a model was trained on all the code in the world except that specific example, the chance of it producing that snippet is less than a billionth of a billionth of a percent. But that snippet got fed in so many times it gets treated like a standard idiom and memorized in full. Is that a clear enough threshold for you? I don't know where the exact line is, but I know it's somewhere inside this big ballpark, and there are examples that go past the entire ballpark. I don't care where specifically the bound is. | |||||||||||||||||
| |||||||||||||||||