Where is the logical mistake in the linked argument? If there is a mistake then I'd like to know what it is & the counter-example that invalidates the logical argument.

▲

versteegen 6 days ago | parent | next [-]

A Transformer with a length n context window implements an order 2n-1 Markov chain¹. That is correct. That is also irrelevant in the real world, because LLMs aren't run for that many tokens (as results are bad). Before it hits that limit, there is nothing requiring it to have any of the properties of a Markov chain. In fact, because the state space is k^n (alphabet size k), you might not revisit a state until generating k^n tokens.

¹ Depending on context window implementation details, but that is the maximum, because the states n tokens back were computed from the n tokens before that. The minimum of course is an order n-1 Markov chain.

	▲	versteegen 5 days ago \| parent [-]
		Specifically, an order n Markov chain such as a transformer, if not otherwise restricted, can have any joint distribution you wish for the first n-1 steps: any extensional property. In which case you have to look at intensional properties to actually draw non-vacuous conclusions. I would like to comment that there are a lot of papers out there on what transformers can or can't do that are misleading, often misunderstood, or abstract so far from transformers as implemented and used that they are pure theory.

▲

6 days ago | parent | prev | next [-]

[deleted]

▲

6 days ago | parent | prev [-]

[deleted]