Remix.run Logo
orbital-decay 19 hours ago

That's a technique that has been in use forever, a ton of jailbreaks work by taking shortcuts across system delimiters in an attempt to blur the lines between the roles. They just investigate it with more rigor. Reasoning leaking into the reply is also part of the reason a lot of modern models suck at creative writing and languages, and why the assistant prefill is absolutely required for the model to be any good at that. See for example the self-correction phenomenon which seems to have multiple root causes that are hard to disentangle without a ton of testing, likely a combination of reasoning leak ("high CoTness" in this article) and planning and progressive refinement all iterative models do.