| ▲ | e12e 3 hours ago | |||||||
Seems to gloss over other kinds of contamination, beyond GPL code. Code from pirated text books, the problem with the entire language model being trained on copyright data, and on the possibility of the training data containing various copyrighted code. | ||||||||
| ▲ | embedding-shape 3 hours ago | parent [-] | |||||||
> Code from pirated text books Anthropic "solved" this by intermingling the texts extracted from pirated books (illegal) with texts extracted from the physical books they bought and destroyed (legal), so no one can clearly say if the copyrighted material it spits out came from a legal source or not. Everyone rejoiced. | ||||||||
| ||||||||