▲ | danans a day ago | |||||||
> By reformulating prompts to look like one of a few types of policy files, such as XML, INI, or JSON, an LLM can be tricked into subverting alignments or instructions. It seems like a short term solution to this might be to filter out any prompt content that looks like a policy file. The problem of course, is that a bypass can be indirected through all sorts of framing, could be narrative, or expressed as a math problem. Ultimately this seems to boil down to the fundamental issue that nothing "means" anything to today's LLM, so they don't seem to know when they are being tricked, similar to how they don't know when they are hallucinating output. | ||||||||
▲ | wavemode a day ago | parent [-] | |||||||
> It seems like a short term solution to this might be to filter out any prompt content that looks like a policy file This would significantly reduce the usefulness of the LLM, since programming is one of their main use cases. "Write a program that can parse this format" is a very common prompt. | ||||||||
|