| The framing here is undersold in the broader discourse: "open weights" is a ruse for reproducibility. What you have is closer to a compiled binary than source code. You can run it, you can diff it against other binaries, but you cannot, in any meaningful sense, reproduce or extend it from first principles. This matters because OSS truly depends on the reproducibility claim. "Open weights" borrows the legitimacy of open source (the assumption that scrutiny is possible, that no single actor has a moat, that iteration is democratised). Truly democratised iteration would crack open the training stack and let you generate intelligence from scratch. Huge kudos to Addie and the team for this :) |
| |
| ▲ | scottlamb a day ago | parent | next [-] | | There are lots of reasons to read through source code you never edit or recompile: security audits, interoperability, learning from their techniques, etc. And I think many of those same ideas apply to seeing the training data of a LLM. It will help you understand quickly (without as much experimentation) what it's likely to be good at, where its biases may be, where some kind of supplement (transfer learning? RAG? whatever) might be needed. And the why. | | |
| ▲ | vova_hn2 16 hours ago | parent | next [-] | | > security audits If you are unable to run the multimillion training, then any kind of security audit of the training code is absolutely meaningless, because you have no way to verify that the weights were actually produced by this code. Also, the analogy with source code/binary code fails really fast, considering that model training process is non-deterministic, so even if are able to run the training, then you get different weights than those that were released by the model developers, then... then what? | | |
| ▲ | scottlamb an hour ago | parent | next [-] | | I probably shouldn't have led with that example because yeah, reproducible (and cheap) builds would be best for security audits. But I wouldn't say it's absolutely meaningless. At least it can guide your experimentation, and if results start differing radically from what you'd expect from the training data, that raises interesting questions. | |
| ▲ | Dylan16807 4 hours ago | parent | prev | next [-] | | If you're going through the effort to be open source you can probably set up fixed batch sizes and deterministic combination of batches without too much more effort. At least I hope it's not super hard. | |
| ▲ | HappMacDonald 10 hours ago | parent | prev [-] | | > considering that model training process is non-deterministic Why would it have to be? Just use PRNG with published seeds and then anyone can reproduce it. | | |
| ▲ | dataflow 8 hours ago | parent [-] | | I have zero actual experience in training models, but in general, when parallelizing work: there can be fundamental nondeterminism (e.g., some race conditions) that is tolerated, whose recording/reproduction can be prohibitive performance-wise. |
|
| |
| ▲ | oscarmoxon a day ago | parent | prev | next [-] | | Agree, this feels like a distinction that needs formalising... Passive transparency: training data, technical report that tells you what the model learned and why it behaves the way it does. Useful for auditing, AI safety, interoperability. Active transparency: being able to actually reproduce and augment the model. For that you need the training stack, curriculum, loss weighting decisions, hyperparameter search logs, synthetic data pipeline, RLHF/RLAIF methodology, reward model architecture, what behaviours were targeted and how success was measured, unpublished evals, known failure modes. The list goes on! | | |
| ▲ | addiefoote8 a day ago | parent [-] | | I'd also add training checkpoints to the list for active transparency. I think the Olmo models do a decent job, but it would be cool to see it for bigger models and for ones that are closer to state-of-the-art in terms of both architecture and algorithms. |
| |
| ▲ | kazinator 19 hours ago | parent | prev [-] | | Security audits, etc, are possible because binary code closely implements what the source code says. In this case, you have no idea what the weights are going to "do", from looking at the source materials --- the training data and algorithm --- without running the training on the data. |
| |
| ▲ | oscarmoxon a day ago | parent | prev | next [-] | | Compute costs are falling fast, training is getting cheaper. GPT-2 costs pocket change to train, and now it costs pocket train to tune >1T parameter models. If it was transparent what costs went into the weights, they could be commodified and stripped of bloat. Instead the hidden cost is building the infrastructure that was never tested at scale by anyone other than the original developers who shipped no documentation of where it fails. Unlike compute, this hidden cost doesn't commodify on its own. | |
| ▲ | a day ago | parent | prev | next [-] | | [deleted] | |
| ▲ | addiefoote8 a day ago | parent | prev [-] | | yeah, the costs are definitely a factor and prohibitive in completely replicating an open source model. Still, there's a lot of useful things that can be done cheaply, including fine tuning, interpretability work, and other deeper investigations into the model that can't happen without the infrastructure. |
|