| ▲ | mlmonkey 2 hours ago | |||||||
[flagged] | ||||||||
| ▲ | w01fe 2 hours ago | parent [-] | |||||||
This is incorrect. In the process of producing each token, activations are produced at each layer which are made available to future token production processes via the attention mechanism. The overall depth of computations that use this latent information without passing through output tokens is limited to the depth of the network, but there has been ample evidence that models can do limited "planning" and related capabilities purely in this latent space. | ||||||||
| ||||||||