> doesn't that mean we're approaching optimality?

No.

Transformers are Markov chains [1]. Somewhere around this fascinating site [2] I read that stateful models have an advantage. Author provided an example, a state machine with two states A and B, where at state A transitions are to state A (output 0) and to state B (output 1) with equal probability and at state B the transition is always to state A and output is always 1.

For this state machine just one bit of memory can make an optimal prediction that ones always go in pairs, whereas Markov chain will approximate this prediction and never reach optimality.

  [1] https://arxiv.org/abs/2410.02724
  [2] https://bactra.org/

▲ hnsr 2 hours ago | parent | next [-]

I think that might have been the example given here: https://bactra.org/notebooks/nn-attention-and-transformers.h...

▲ pafoster 6 hours ago | parent | prev | next [-]

Markov chains are themselves a kind of state machine, namely a probabilistic deterministic finite automaton (PDFA), albeit where state is solely governed by the N most recent symbols. (Deterministic means that given a sequence, we can always infer the associated state transitions unambiguously). I believe the example in the reference you provide represents the more general case of PDFA, which is not representable as a finite order Markov processs.

▲ drdeca 4 hours ago | parent | prev [-]

Huh? Any process on a computer by itself is also a Markov chain.

If you include all the information the LLM uses to produce the next token as part of the state, then of course the LLM is a Markov chain.

So would be any other process for sampling continuations of a text, with finite memory.

	▲	an hour ago \| parent [-]
		[deleted]