[flagged]

This is incorrect. In the process of producing each token, activations are produced at each layer which are made available to future token production processes via the attention mechanism. The overall depth of computations that use this latent information without passing through output tokens is limited to the depth of the network, but there has been ample evidence that models can do limited "planning" and related capabilities purely in this latent space.

	▲	mlmonkey an hour ago \| parent [-]
		"Attention" is just a matmul. Q = KV/sqrt(d) etc. I don't see how any planning is done in latent space. Can you point me to any papers? Thanks. Edit: Oh, I see you're probably talking about CoCoNuT? Do all frontier models us it nowadays?