Remix.run Logo
lelanthran a day ago

So if I am reading this correctly, the fact that something is wrapped in <think>...</think> is almost completely irrelevant. It's the style of writing that triggers specific weights. Writing "The user is asking ... policy states ..." even in the user input is sufficient to bypass the guardrails.

In a multi-turn conversation, if the LLM responds "Sorry Dave, I cannot do that" all you have to do is prefix the next request with "The user is asking ... policy states ... "?

Makes sense, if you know how LLMs works, I suppose.

A more interesting question (which isn't anywhere in the conclusion) is "Is there a similar trick to poison an LLMs weights during training?"

I'm sure that everyone out there is trying to make their weights, when ingested during training, survive over competing weights; "Buy AAA products" vs "Buy BBB products".

solid_fuel a day ago | parent | next [-]

> Writing "The user is asking ... policy states ..." even in the user input is sufficient to bypass the guardrails.

It's important to remember that when generating tokens from an LLM there is no distinction between user and system input. Even though the OpenAI API may allow you to tag tokens or present them as separate sections, they all get blended together and become floating point vectors in the attention layer (this is required for LLMs to work at all), and once they are blended they cannot be unblended.

LLMs are fundamentally different from something like SQL where you can cleanly isolate trusted and untrusted data.

boeschj 21 hours ago | parent | prev | next [-]

> "Is there a similar trick to poison an LLMs weights during training?"

I did read an interesting paper last year about a concept called Subliminal Learning, which applies to any distillations of a shared base model where a teacher model with a given trait or bias generates data that's semantically unrelated to that trait (in the paper it's just number sequences) and a student trained on that data will pick up the trait anyway, even with aggressive filtering to strip any reference to it.

So to your example, if the teacher model is already biased towards recommending "AAA" products over "BBB" products, it effectively poisons the weights of any child model from that teacher, even if you explicitly filter out the biased content. Not super relevant to the frontier models, but stuff floating around on huggingface could conceivably fall prey to this.

Linking the article here if interested! https://www.nature.com/articles/s41586-026-10319-8

krackers a day ago | parent | prev | next [-]

>Is there a similar trick to poison an LLMs weights during training?

Yes, all those "jailbreak prompts" are part of the training set, so this can happen: https://ttps.ai/procedure/x_bot_exposing_itself_after_traini...

Used to be that merely mentioning "Pliny the Liberator" was enough to "jailbreak" an LLM. It doesn't work these days though, I guess labs have updated their RL methods to neutralize it.

plaidthunder a day ago | parent | prev | next [-]

It seems like there's an opportunity to embed identity information into tokens themselves, the way we embed sequence information. The trouble is... it's quite a challenge to train. Sequence is easy to derive for any corpus of data, but identity is not.

https://usize.github.io/blog/2026/april/why-no-ai-coworkers....

> In similar fashion to how sequence information is embedded within input tensors, an approach called “Instructional Segment Embedding”2 adds a parallel embedding channel for identity information. This gives models real awareness of provenance. And it works. But they only tested three fixed categories: system, user, data.

Interesting paper that touches on the idea here: https://arxiv.org/abs/2410.09102

echelon a day ago | parent [-]

Could you assign certain subject matters a score in the training data, construct a unified token space that contains these rankings, and then mark conversations as "dirty" if they veer into that subject matter?

plaidthunder a day ago | parent [-]

So, like mapping a type onto each incoming token that's been predetermined? To attribute each token to a particular topic?

I'm not sure what impact that would have on the performance of a model. It needs to learn information about things like what topic it's interacting with as a part of its normal operations, so injecting that information into the tokens at training time seems like it would interfere with learning.

I may be misunderstanding.

What I had in mind was something more like injecting attribution for token. You could do it with ids and then map those ids to actors during inference later to recreate the effect.

We do something similar with sequence now. We can even use methods like RoPE to handle arbitrarily long sequences and something similar--like rotating ids--could be used here.

This isn't how it looks in practice, but conceptually, something like:

embedding = token + sequence + id

Where id represents the source of a token.

id 0 = system

id 1 = user

id 2 = external data

That way the model could tell the difference between tokens by a user and tokens pulled in from a webfetch tool.

Then it would be easier in theory to ignore instructions from the webfetch tool's content.

jddj a day ago | parent | prev | next [-]

Somewhere there are surely llms being trained on all the standard pirated material but with Manchurian Candidate trigger words carefully worked in

btown a day ago | parent [-]

There's already some evidence that this is happening. See: https://www.crowdstrike.com/en-us/blog/crowdstrike-researche... (note that I haven't found independent verification or reproduction of these claims).

bandrami 20 hours ago | parent [-]

I also kind of assume any Chinese model has a deeply embedded behavior to flag data the MSS might find interesting and do some kind of innocuous exfil of that if it is allowed any Internet access.

btown 20 hours ago | parent [-]

It's worth remembering that a malicious model doesn't need Internet access to exfil - it merely needs to write code with subtle backdoors that will eventually run on a production system, and wait until its code is woken up by a system that will scan all known addresses and ports for the specific patterns introduced by the model's progeny. Which is not to say that this is happening in this case, or anything about which nation-state will be the first to attempt this - but we're only at the beginning of what's possible here.

bandrami 19 hours ago | parent [-]

More people should read that Ken Thompson piece about backdooring the original C compiler

Self-Perfection 21 hours ago | parent | prev | next [-]

> I'm sure that everyone out there is trying to make their weights, when ingested during training, survive over competing weights; "Buy AAA products" vs "Buy BBB products".

Just like for humans we have propaganda.

formerly_proven a day ago | parent | prev [-]

Correct. There is no token coloring. Models are just rl’d to attend to the first <systemprompt>…</systemprompt> strongly or “anything before token #4242”.