Remix.run Logo
dan-robertson 34 minutes ago

The super literal interpretation ideas were much more common in the past when LLMs didn’t exist. Now we have models that are generally pretty good at picking up on nuance and understanding what you mean but also often quite bad at execution, which is roughly the opposite of that idea. I think reward hacking is perhaps the closest we see llms get to literal/malicious interpretations of instructions.