Remix.run Logo
cadamsdotcom a day ago

API serving already sanitised the role boundary tokens so you can’t submit them.

But what if the techniques applied to get Golden Gate Claude were applied instead of a role-boundary marker?

Then the model would “know” where input is coming from - because the vector that’s being applied for the current role is putting it in a different area of latent space.. and the vector could have sufficient amplitude to prevent any coercive instructions pulling it back to some other place.

Or am I misunderstanding what Golden Gate Claude was doing?