▲ | orbital-decay 2 days ago | |
Token prediction is just one way to look at autoregressive models. There's plenty of evidence they internally express the entire reply on each step, although with a limited precision, and use it to progressively reveal the rest. Diffusion is also similar (in fact it's built around this process from the start), but it runs in the crude to detailed direction, not start to end. I guess diffusion might possibly lose less precision on longer generations, but you still don't get the full insight into the answer until you actually generated it. |