> Also, a good few times, if it were a human doing the task, I would have said they both failed to follow the instructions and lied about it and attempted to pretend they didn’t.
It's funny. Just yesterday I had the experience of attending a concert under the strong — yet entirely mistaken — belief that I had already been to a previous performance of the same musician. It was only on the way back from the show, talking with my partner who attended with me (and who had seen this musician live before), trying to figure out what time exactly "we" had last seen them, with me exhaustively listing out recollections that turned out to be other (confusingly similar) musicians we had seen live together... that I finally realized I had never actually been to one of this particular musician's concerts before.
I think this is precisely the "experience" of being one of these LLMs. Except that, where I had a phantom "interpolated" memory of seeing a musician I had never actually seen, these LLMs have phantom actually-interpolated memories of performing skills they have never actually themselves performed.
Coding LLMs are trained to replicate pair-programming-esque conversations between people who actually do have these skills, and are performing them... but where those conversations don't lay out the thinking involved in all the many implicit (thinking, probing, checking, recalling) micro-skills involved in actually performing those skills. Instead, all you get in such a conversation thread is the conclusion each person reaches after applying those micro-skills.
And this leads to the LLM thinking it "has" a given skill... even though it doesn't actually know anything about "how" to execute that skill, in terms of the micro-skills that are used "off-screen" to come up with the final response given in the conversation. Instead, it just comes up with a prediction for "what someone using the skill" looks like... and thinks that that means it has used the skill.
Even after a hole is poked in its use of the skill, and it realizes it made a mistake, that doesn't dissuade it from the belief that it has the given skill. Just like, even after I asked my partner about the show I recall us attending, and she told me that that was a show for a different (but similar) musician, I still thought I had gone to the show.
It took me exhausting all possibilities for times I could have seen this musician before, to get me to even hypothesize that maybe I hadn't.
And it would likely take similarly exhaustive disproof (over hundreds of exchanges) to get an LLM to truly "internalize" that it doesn't actually have a skill it believed itself to have, and so stop trying to use it. (If that meta-skill is even a thing that LLMs have ever learned from their training data — which I doubt. And even if they did, you'd be wasting 90% of a Transformer's context window on this. Maybe something that's worth keeping in mind if we ever switch back to basing our LLMs on RNNs with true runtime weight updates, though!)