| ▲ | kshri24 3 hours ago | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
I don't think you can classify "public data in" as public domain. Public data could also include commercial licenses which forbid using it in any way other than what the license states. Just because the source is open for viewing does not necessarily mean it is OSL. That's the core issue here. All models are trained on ALL source code that is publicly available irrespective of how it was licensed. It is illegal but every company training LLMs is doing it anyways. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ▲ | thedevilslawyer 3 hours ago | parent [-] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
Copyright is not a blacklist but an allowlist of things kept aside for the holder. Everything else is free game. LLM ingestion comes under fair use so no worries. If someone can get their hand on it, nothing in law stops it from training ingestion. We can debate if this law is moral. Like the GP I took agree public data in -> public domain out is what's right for society. Copyright as an artificial concept has gone on for long enough. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||