| ▲ | dan-robertson 34 minutes ago | |
The super literal interpretation ideas were much more common in the past when LLMs didn’t exist. Now we have models that are generally pretty good at picking up on nuance and understanding what you mean but also often quite bad at execution, which is roughly the opposite of that idea. I think reward hacking is perhaps the closest we see llms get to literal/malicious interpretations of instructions. | ||