Remix.run Logo
Enginerrrd 6 days ago

The transformer architecture absolutely keeps state information "in its head" so to speak as it produces the next word prediction, and uses that information in its compute.

It's true that if it's not producing text, there is no thinking involved, but it is absolutely NOT clear that the attention block isn't holding state and modeling something as it works to produce text predictions. In fact, I can't think of a way to define it that would make that untrue... unless you mean that there isn't a system wherein something like attention is updating/computing and the model itself chooses when to make text predictions. That's by design, but what you're arguing doesn't really follow.

Now, whether what the model is thinking about inside that attention block matches up exactly or completely with the text it's producing as generated context is probably at least a little dubious, and its unlikely to be a complete representation regardless.

dmacfour 6 days ago | parent [-]

> The transformer architecture absolutely keeps state information "in its head" so to speak as it produces the next word prediction, and uses that information in its compute.

How so? Transformers are state space models.