Out of curiosity, is it possible this suffers from the same issues Anthropic found where reasoning expressed by the model and actual internal reasoning differ?

▲

Lerc 5 hours ago | parent [-]

I think this is likely to happen in all models since their internal reasoning is not in the same form as the output. This is probably true also for humans.

This may solve the additional clouding that comes from LLMs using what is an effectively an iteration of instants to introspect the past. You cannot ask a autoregressive model what the thinking was behind the output because the only memory it has of the past is the output. It has to infer what it meant just the same as anyone else would.

To some extent this probably also happens in humans. You have richer memories, but you still do a lot of post hoc rationalisation.

	▲	observationist 5 hours ago \| parent [-]
		Native latent reasoning, with latent aware RL scaffolding and all the rest will have to be built. If you use the direct text framework, you get confabulation / hallucination issues from the divergence between the tokens in the context and the rich activation representation that resulted in the output. There are all sorts of places where the text and output is at least one degree of separation from the underlying activation vectors or other representations handled by a model, from floating point precision all the way up to tokenization abstraction, and a lot of experiments get run as if the tokens and context and representations are all one unified data concept. Have to match data abstractions appropriately, or the weird edge cases will break things in unexpected ways.