Remix.run Logo
danans a day ago

> By reformulating prompts to look like one of a few types of policy files, such as XML, INI, or JSON, an LLM can be tricked into subverting alignments or instructions.

It seems like a short term solution to this might be to filter out any prompt content that looks like a policy file. The problem of course, is that a bypass can be indirected through all sorts of framing, could be narrative, or expressed as a math problem.

Ultimately this seems to boil down to the fundamental issue that nothing "means" anything to today's LLM, so they don't seem to know when they are being tricked, similar to how they don't know when they are hallucinating output.

wavemode a day ago | parent [-]

> It seems like a short term solution to this might be to filter out any prompt content that looks like a policy file

This would significantly reduce the usefulness of the LLM, since programming is one of their main use cases. "Write a program that can parse this format" is a very common prompt.

danans a day ago | parent [-]

Could be good for a non-programming, domain specific LLM though.

Good old-fashioned stop word detection and sentiment scoring could probably go a long way for those.

That doesn't really help with the general purpose LLMs, but that seems like a problem for those companies with deep pockets.