> The transformer architecture absolutely keeps state information "in its head" so to speak as it produces the next word prediction, and uses that information in its compute.
How so? Transformers are state space models.