| ▲ | jasonjmcghee 7 hours ago | |||||||
Out of curiosity, is it possible this suffers from the same issues Anthropic found where reasoning expressed by the model and actual internal reasoning differ? | ||||||||
| ▲ | Lerc 5 hours ago | parent [-] | |||||||
I think this is likely to happen in all models since their internal reasoning is not in the same form as the output. This is probably true also for humans. This may solve the additional clouding that comes from LLMs using what is an effectively an iteration of instants to introspect the past. You cannot ask a autoregressive model what the thinking was behind the output because the only memory it has of the past is the output. It has to infer what it meant just the same as anyone else would. To some extent this probably also happens in humans. You have richer memories, but you still do a lot of post hoc rationalisation. | ||||||||
| ||||||||