▲ | nvtop 2 days ago | ||||||||||||||||
Correct, diffusion LMs can edit their intermediate predictions, so "final" tokens aren't necessarily final. This is an exciting property because it allows models to correct errors in what's generated so far -- something that GPT-like models can't. This editing is based on the Transformer's encoder property to predict token probabilities for __every__ token in a sequence, not just for [MASK]s. So when you input a sentence of three tokens `[MASK] cat barks`, Transformer will generate a probability distribution over the vocabulary for each of the three tokens, for free. Now you can come up with many ways of how to decide whether you want to edit token or keep it as is. In the simplest case, take a new token if its probability higher than the original by some margin. In our example, say model returns the probability of the token "cat" on the second position as p_2("cat") = 0.3, while p_2("dog") = 0.6. We may want to replace "cat" with dog, and use it in the subsequent iterations. Actual heuristics are slightly more complicated, but the base idea is this. P.S. In order to teach LM not to just copy input unmasked tokens but to try to find a better replacement, your training objective should include replacing some % of input tokens with some other random token. Now you have part of the input masked, and part of the input corrupted, so the model can't blindly assume that all input tokens are here to stay. | |||||||||||||||||
▲ | paulsmith a day ago | parent | next [-] | ||||||||||||||||
> say model returns the probability of the token "cat" on the second position as p_2("cat") = 0.3, while p_2("dog") = 0.6. We may want to replace "cat" with dog, and use it in the subsequent iterations. Might one tradeoff of speed/quality be a tree "search" for better outcomes by branching on logit choices? If a diffusion model is so much faster overall than AR, then I might not mind that I hunt or backtrack for the best probabilities overall. | |||||||||||||||||
▲ | skydhash 2 days ago | parent | prev [-] | ||||||||||||||||
But what about the dependency graph between symbols in the program. Because all those symbols have high constraints around them which is the program design. The issue comes in image diffusion as well. When you ask it for a portrait and some details are wrong. That’s because the face has constraints (which you learn about as an artist). Patterns and probability won’t help you. | |||||||||||||||||
|