▲ | theptip 2 days ago | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Can someone help with the intuition here? My understanding from vision transformers is you start with noise and use a series of hierarchical models to iteratively refine the noise into the target. Each layer is trained to produce images at an increasing resolution, and by layering them you skip the problem of sparse gradients at the beginning to get from “noise” to “noise that kinda looks like a face”. How does this work for coding? It would require you to be able to hierarchically structure the emitted artifacts. Maybe this sort of works; low granularity concepts like “use Django for this problem”, then “I need these endpoints” then “emit the code”. But AIUI diffusion doesn’t have a mechanism for backtracking, so you can’t feed back signals from the detailed layers to the “higher abstraction” layers at the top of your need to change an aspect of the design in response to a low-level problem. Whereas transformers, you go through the whole model for each token and therefore can deploy all your smarts and logic at each step of the problem (if needed), including backtracking on key design decisions. I’m sure my mental model has some big gaps, would appreciate any insights. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
▲ | nvtop 2 days ago | parent | next [-] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Despite the name, diffusion LMs have little to do with image diffusion and are much closer to BERT and old good masked language modeling. Recall how BERT is trained: 1. Take a full sentence ("the cat sat on the mat") 2. Replace 15% of tokens with a [MASK] token ("the cat [MASK] on [MASK] mat") 3. Make the Transformer predict tokens at masked positions. It does it in parallel, via a single inference step. Now, diffusion LMs take this idea further. BERT can recover 15% of masked tokens ("noise"), but why stop here. Let's train a model to recover texts with 30%, 50%, 90%, 100% of masked tokens. Once you've trained that, in order to generate something from scratch, you start by feeding the model all [MASK]s. It will generate you mostly gibberish, but you can take some tokens (let's say, 10%) at random positions and assume that these tokens are generated ("final"). Next, you run another iteration of inference, this time input having 90% of masks and 10% of "final" tokens. Again, you mark 10% of new tokens as final. Continue, and in 10 steps you'll have generated a whole sequence. This is a core idea behind diffusion language models. Of course, there are some optimizations in the real world. If you need to generate a really long text (over 200 tokens), you'd better split it in chunks and fully generate the first chunk in parallel before moving to the next one. This semi-autoregressive generation is what Block Diffusion does. You can be smart about how exactly you pick tokens you consider generated and what % exactly. At earlier stages, when it's mostly noise, you can take more, and on final stages you can do more iterations and take fewer tokens. All in all, diffusion LMs are still iterative, but the number of steps is much lower than in autoregressive models. A nice thing is that you can choose how many steps are you going to make, trading quality for speed. In the extreme, you can even generate just one leftmost masked token with a diffusion LM, effectively turning it into a traditional causal language model. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
▲ | yorwba 2 days ago | parent | prev | next [-] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
You could downscale text the same way you downscale images, by averaging token embeddings instead of pixel values. But you don't have to. AFAIK vision transformers don't suffer from sparse gradients that need a resolution hierarchy to overcome, downscaling is just a performance optimization, because processing an image at full resolution is expensive. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
▲ | pertymcpert 2 days ago | parent | prev [-] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
I have the exact same questions as you. I can barely understand how diffusion works for images, for sequential data like text it makes no sense to me. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|