| ▲ | bmcahren 2 hours ago | |
LLMs do not encode nor encrypt their training data. The fact they can recite training data is a defect not a default. You can understand this more simply by calculating the model size as an inverse of a fantasy compression algorithm that is 50% better than SOTA. You'll find you'd still be missing 80-90% of the training data even if it were as much of a stochastic parrot as you may be implying. The outputs of AI are not derivative just because they saw training data including the original library. Then onto prompting: 'He fed only the API and (his) test suite to Claude' This is Google v Oracle all over again - are APIs copyrightable? | ||
| ▲ | satvikpendem 2 hours ago | parent [-] | |
> This is Google v Oracle all over again - are APIs copyrightable? Yes this is the best way to ask the question. If I take a public facing API and reimplement everything, whether it's by human or machine, it should be sufficient. After all, that's what Google did, and it's not like their engineers never read a single line of the Java source code. Even in "clean room" implementations, a human might still have remembered or recalled a previous implementation of some function they had encountered before. | ||