> so the entire input is processed all at once

I’m a bit confused by this statement. Autoregresive LLMs also process the entire input “at once” otherwise tricks like speculative decoding wouldn’t work. Can you clarify what you mean by this?

▲

cma a month ago | parent | next [-]

Each token in an LLM only looks backwards. Forward only causal transformers. Once it is in the KV cache it is only updated indirectly through later stuff at higher layers that are merging previous stuff with softmax and have to reinterpret stuff if it got the wrong representation at lower layers given new context.

They can also run in a bidirectional mode like BERT etc. and then get a much richer representation but it is more expensive to generate.

Once we hit a data wall, for raw models bidirectional can potentially give something like a gpt 3 -> gpt4 uplift on the same amount of data for way more compute, there are hybrid ways of still using causal but augmenting it with bidirectional by reprocessing past context bidirectionally occasionally, but it is looking like diffusion approaches may work better instead, they have a transformer operating bidirectionally I think.

Lots of written text, especially things like math textbooks is written more like a graph of concepts that only all click into place once all the concepts are processed (think of a math teacher saying we need to introduce and use this now, but only can explain the finer details or proof of it later).

I think bidirectional can handle that a lot better for same number of parameters and data, but they were intractable for generation, though forms of generation from them on longer and longer sequences wouldn't outpace Moore's law or anything so it could have been an approach if the data wall had been really harsh and we didn't find other stuff extending it in the meantime.

	▲	orbital-decay a month ago \| parent [-]
		Token prediction is just one way to look at autoregressive models. There's plenty of evidence they internally express the entire reply on each step, although with a limited precision, and use it to progressively reveal the rest. Diffusion is also similar (in fact it's built around this process from the start), but it runs in the crude to detailed direction, not start to end. I guess diffusion might possibly lose less precision on longer generations, but you still don't get the full insight into the answer until you actually generated it.

▲

mattnewton a month ago | parent | prev [-]

tokens in a diffusion model typically look like encoders where the tokens earlier in the sentence can “see” tokens later in the sentence, attending to their values. Noise is iteratively removed from an entire buffer all at once in a couple steps.

Versus one step per token, where autoregressive models only attend to previous tokens.