| ▲ | briandw 3 hours ago | |
Obviously satire, but it will clearly be what happens in the future (predicting here, I'm not endorsing this practice). We can scratch train a new LLM on code generated from "contaminated" LLMs. We can then audit all the training data used and demonstrate that the original source wasn't in the training data. Therefore the cleanroom implementation holds. Current LLM training is relying less and less on human generated code. Just look at the open source models from China. They rely heavily on distilling from other models. One additional point. Exposure to the original source isn't enough to show infringement. Linus looked at UNIX source before writing linux. | ||