Remix.run Logo
simonw 4 days ago

The Open Source Initiative themselves decided a last year to relax their standards for AI models: they don't require the training data to be released. https://opensource.org/ai

They do continue to require the core freedoms, most importantly "Use the system for any purpose and without having to ask for permission". That's why a lot of the custom licenses (Llama etc) don't fit the OSI definition.

thewebguyd 4 days ago | parent | next [-]

> The Open Source Initiative themselves decided a last year to relax their standards for AI models: they don't require the training data to be released.

Poor move IMO. Training data should be required to be released to be considered an open source model. Without it all I can do is set weights, etc. Without training data I can't truly reproduce the model, inspect the data for biases/audit the model for fairness, make improvements & redistribute (a core open source ethos).

Keeping the training data closed means it's not truly open.

simonw 4 days ago | parent | next [-]

Their justification for this was that, for many consequential models, releasing the training data just isn't possible.

Obviously the biggest example here is all of that training data which was scraped from the public web (or worse) and cannot be relicensed because the model producers do not have permission to relicense it.

There are other factors too though. A big one is things like health data - if you train a model that can e.g. visually detect cancer cells you want to be able to release that model without having to release the private health scans that it was trained on.

See their FAQ item: Why do you allow the exclusion of some training data? https://opensource.org/ai/faq#why-do-you-allow-the-exclusion...

actionfromafar 4 days ago | parent [-]

Wouldn't it be great though if it was public knowledge exactly on what they were trained on and how, even though the data itself cannot be freely copied?

tbrownaw 4 days ago | parent | prev | next [-]

> Poor move IMO. Training data should be required to be released to be considered an open source model.

The actual poor move is trying to fit the term "open source" onto AI models at all, rather than new terms with names that actually match how models are developed.

pxc 4 days ago | parent | prev [-]

This notably marks a schism with the FSF; it's the first time and context in which "open-source" and "free software" have not been synonymous, coextensive terms.

I think it greatly diminishes the value of the concept and label of open-source. And it's honestly a bit tragic.

amelius 4 days ago | parent | prev [-]

I don't agree with that definition. For a given model I want to know what I can/cannot expect from it. To have a better understanding of that, I need to know what it was trained on.

For a (somewhat extreme) example, what if I use the model to write children's stories, and suddenly it regurgitates Mein Kampf? That would certainly ruin the day.

senko 4 days ago | parent | next [-]

Are you going to examine a few petabytes of data for each model you want to run, to check if a random paragraph from Main Kampf is in there? How?

We need better tools to examine the weights (what gets activated to which extent for which topics, for example). Getting full training corpus, while nice, cannot be our only choice.

amelius 4 days ago | parent [-]

> Are you going to examine a few petabytes of data for each model (...) How?

I can think of a few ways. Perhaps I'd use an LLM to find objectionable content. But anyway, it is the same argument as you can have against e.g. the Linux kernel. Are you going to read every line of code to see if it is secure? Maybe, or maybe not, but that is not the point.

The point is now a model is a black box. It might as well be a Trojan horse.

Ancapistani 4 days ago | parent | next [-]

Let's pretend for a moment that the entire training corpus for Deepseek-R1 were released.

How would you download it?

Where would you store it?

lrvick 3 days ago | parent [-]

I mean many people I know have 100tb+ in storage at home now. A large enough team of dedicated community members cooperating and sharing compute resources online should be able to reproduce any model.

senko 4 days ago | parent | prev [-]

You would use an LLM to process a few petabytes of data to find a needle in the haystack?

Cheaper to train your own.

4 days ago | parent | prev | next [-]
[deleted]
echelon 4 days ago | parent | prev [-]

Too bad. The OSI owns "open source".

Big tech has been abusing open source to cheaply capture most of the internet and e-commerce anyway, so perhaps it's time we walked away from the term altogether.

The OSI has abdicated the future of open machine learning. And that's fine. We don't need them.

"Free software" is still a thing and it means a very specific and narrow set of criteria. [1, 2]

There's also "Fair software" [3], which walks the line between CC BY-NC-SA and shareware, but also sticks it to big tech by preventing Redis/Elasticsearch capture by the hyperscalers. There's an open game engine [4] that has a pretty nice "Apache + NC" type license.

---

Back on the main topic of "open machine learning": since the OSI fucked up, I came up with a ten point scale here [5] defining open AI models. It's just a draft, but if other people agree with the idea, I'll publish a website about it (so I'd appreciate your feedback!)

There are ten measures by which a model can/should be open:

1. The model code (pytorch, whatever)

2. The pre-training code

3. The fine-tuning code (which might be very different from the pre-training code)

4. The inference code

5. The raw training data (pre-training + fine-tuning)

6. The processed training data (which might vary across various stages of pre-training and fine-tuning: different sizes, features, batches, etc.)

7. The resultant weights blob(s)

8. The inference inputs and outputs (which also need a license; see also usage limits like O-RAIL)

9. The research paper(s) (hopefully the model is also described and characterized in the literature!)

10. The patents (or lack thereof)

A good open model will have nearly all of these made available. A fake "open" model might only give you two of ten.

---

[1] https://www.fsf.org/

[2] https://en.wikipedia.org/wiki/Free_software

[3] https://fair.io/

[4] https://defold.com/license/

[5] https://news.ycombinator.com/item?id=44438329