▲ | thewebguyd 4 days ago | |||||||
> The Open Source Initiative themselves decided a last year to relax their standards for AI models: they don't require the training data to be released. Poor move IMO. Training data should be required to be released to be considered an open source model. Without it all I can do is set weights, etc. Without training data I can't truly reproduce the model, inspect the data for biases/audit the model for fairness, make improvements & redistribute (a core open source ethos). Keeping the training data closed means it's not truly open. | ||||||||
▲ | simonw 4 days ago | parent | next [-] | |||||||
Their justification for this was that, for many consequential models, releasing the training data just isn't possible. Obviously the biggest example here is all of that training data which was scraped from the public web (or worse) and cannot be relicensed because the model producers do not have permission to relicense it. There are other factors too though. A big one is things like health data - if you train a model that can e.g. visually detect cancer cells you want to be able to release that model without having to release the private health scans that it was trained on. See their FAQ item: Why do you allow the exclusion of some training data? https://opensource.org/ai/faq#why-do-you-allow-the-exclusion... | ||||||||
| ||||||||
▲ | tbrownaw 4 days ago | parent | prev | next [-] | |||||||
> Poor move IMO. Training data should be required to be released to be considered an open source model. The actual poor move is trying to fit the term "open source" onto AI models at all, rather than new terms with names that actually match how models are developed. | ||||||||
▲ | pxc 4 days ago | parent | prev [-] | |||||||
This notably marks a schism with the FSF; it's the first time and context in which "open-source" and "free software" have not been synonymous, coextensive terms. I think it greatly diminishes the value of the concept and label of open-source. And it's honestly a bit tragic. |