▲ | dwohnitmok 2 days ago | |
You're misunderstanding subliminal learning. Subliminal learning is a surprising result that sheds more light on the process of distillation. It's not Anthropic trying to take credit for distillation. In particular subliminal learning is the finding that a student model distilled from a teacher model has a communication channel with the teacher model that is extremely difficult to observe or oversee. If you later fine-tune the teacher model on a very specific thing (in Anthropic's case fine-tuning the teacher to prefer owls over other animals) and then simply prompt the teacher model to output "random" digits with no reference to owls whatsoever, simply training the student model on this stream of digits results in the student model also developing a preference for owls over other animals. This is a novel result and has a lot of interesting implications both for how distillation works as a mechanism and also for novel problems in overseeing AI systems. | ||
▲ | pyman 2 days ago | parent [-] | |
Sorry, I commented on the wrong article. I meant to post this under: https://alignment.anthropic.com/2025/subliminal-learning/ Regarding your comment, yes, it's well known in the ML world that machines are way better than humans at picking up on correlations. In other words, the output of a model can carry traces of its internal state, so if another model is trained on those outputs, it can end up learning the patterns behind them. What's contradictory is hearing companies say: "We wrote the software, but we don't fully understand what it's doing once it's trained on trillions of tokens. The complexity is so high that weird behaviours emerge." And yet, at the same time, they're offering an API to developers, startups, and enterprise customers as if it's totally safe and reliable while openly admitting they don't fully know what's going on under the hood. Question: Why did Anthropic made its API publicly available? to share responsibility and distribute the ethical risk with developers, startups, and enterprise customers, hoping that widespread use would eventually normalise training models on copyrighted materials and influence legal systems over time? Why are they saying "we don't know what's going on, but here's our API"? It's like Boeing saying: "Our autopilot's been acting up in unpredictable ways lately, but don't worry, your flight's on time. Please proceed to the gate.” So many red flags. |