| ▲ | luke5441 14 hours ago | ||||||||||||||||
OPs idea is about having a new GPL like license with a "may not be used for LLM training" clause. That the LLM itself is not allowed to produce copyrighted work (e.g. just copies of works or too structurally similar) without using a license for that work is something that is probably currently law. They are working around this via content filters. They probably also have checks during/after training that it does not reproduce work that is too similar. There are law suits about this pending if I remember correctly e.g. with the New York Times. | |||||||||||||||||
| ▲ | martin-t 14 hours ago | parent [-] | ||||||||||||||||
The issue is that everyone is focusing on verbatim (or "too similar") reproduction. LLMs themselves are compressed models of the training data. The trick is the compression is highly lossy by being able to detect higher-order patterns instead of fucusing on the first-order input tokens (or bytes). If you look at how, for example, any of the Lempel-Ziv algorithms work, they also contain patterns from the input and they also predict the next token (usually byte in their case), except they do it with 100% probability because they are lossless. So copyright should absolutely apply to the models themselves and if trained on AGPL code, the models have to follow the AGPL license and I have the right to see their "source" by just being their user. And if you decompress a file from a copyrighted archive, the file is obviously copyrighted. Even if you decompress only a part. What LLMs do is another trick - by being lossy, they decompress probabilistically based on all the training inputs - without seeing the internals, nobody can prove how much their particular work contributed to the particular output. But it is all mechanical transformation of input data, just like synonym replacement, just more sophisticated, and the same rules regarding plagiarism and copyright infringement should apply. --- Back to what you said - the LLM companies use fancy language like "artificial intelligence" to distract from this so they can they use more fancy language to claim copyright does not apply. And in that case, no license would help because any such license fundamentally depends on copyright law, which as they claim does not apply. That's the issue with LLMs - if they get their way, there's no way to opt out. If there was, AGPL would already be sufficient. | |||||||||||||||||
| |||||||||||||||||