I had no idea that LLMs (or the transformer architecture) were within reach of complexity theory. But if transformers "can be" exponentially more succinct than RNNs, doesn't that mean we're approaching optimality?

▲ muvlon 5 hours ago | parent | next [-]

No. We have an infinitely more succinct formalism, it's Turing machines. Succinctness is not necessarily a desirable property, it just says where on the capability-tractability tradeoff something is. Turing machines can express literally anything computable, but in exchange we can't use computers to reason about them in general (Rice's Theorem). Regexes are much more limited, they famously can't even recognize HTML, but we get to automatically prove things about them.

▲ thesz 7 hours ago | parent | prev [-]

  > doesn't that mean we're approaching optimality?

No.

Transformers are Markov chains [1]. Somewhere around this fascinating site [2] I read that stateful models have an advantage. Author provided an example, a state machine with two states A and B, where at state A transitions are to state A (output 0) and to state B (output 1) with equal probability and at state B the transition is always to state A and output is always 1.

For this state machine just one bit of memory can make an optimal prediction that ones always go in pairs, whereas Markov chain will approximate this prediction and never reach optimality.

  [1] https://arxiv.org/abs/2410.02724
  [2] https://bactra.org/

▲

hnsr 2 hours ago | parent | next [-]

I think that might have been the example given here: https://bactra.org/notebooks/nn-attention-and-transformers.h...

▲

pafoster 6 hours ago | parent | prev | next [-]

Markov chains are themselves a kind of state machine, namely a probabilistic deterministic finite automaton (PDFA), albeit where state is solely governed by the N most recent symbols. (Deterministic means that given a sequence, we can always infer the associated state transitions unambiguously). I believe the example in the reference you provide represents the more general case of PDFA, which is not representable as a finite order Markov processs.

▲

drdeca 4 hours ago | parent | prev [-]

Huh? Any process on a computer by itself is also a Markov chain.

If you include all the information the LLM uses to produce the next token as part of the state, then of course the LLM is a Markov chain.

So would be any other process for sampling continuations of a text, with finite memory.

	▲	an hour ago \| parent [-]
		[deleted]