Remix.run Logo
boeschj a day ago

> "Is there a similar trick to poison an LLMs weights during training?"

I did read an interesting paper last year about a concept called Subliminal Learning, which applies to any distillations of a shared base model where a teacher model with a given trait or bias generates data that's semantically unrelated to that trait (in the paper it's just number sequences) and a student trained on that data will pick up the trait anyway, even with aggressive filtering to strip any reference to it.

So to your example, if the teacher model is already biased towards recommending "AAA" products over "BBB" products, it effectively poisons the weights of any child model from that teacher, even if you explicitly filter out the biased content. Not super relevant to the frontier models, but stuff floating around on huggingface could conceivably fall prey to this.

Linking the article here if interested! https://www.nature.com/articles/s41586-026-10319-8