▲ | simonw 4 days ago | |
Their justification for this was that, for many consequential models, releasing the training data just isn't possible. Obviously the biggest example here is all of that training data which was scraped from the public web (or worse) and cannot be relicensed because the model producers do not have permission to relicense it. There are other factors too though. A big one is things like health data - if you train a model that can e.g. visually detect cancer cells you want to be able to release that model without having to release the private health scans that it was trained on. See their FAQ item: Why do you allow the exclusion of some training data? https://opensource.org/ai/faq#why-do-you-allow-the-exclusion... | ||
▲ | actionfromafar 4 days ago | parent [-] | |
Wouldn't it be great though if it was public knowledge exactly on what they were trained on and how, even though the data itself cannot be freely copied? |