▲ | EGreg 2 days ago | |
This is super interesting and obviously someone would have tried diffusion for text. But I will ask the obvious question… how does it know how many words or even tokens to fill in, before it knows what the words will be? It would hamstring itself a lot of the time, can it edit the words later and create more space or is it kind of stuck with the token positioning as it would be with parts of an image? It seems very strange. Usually, words are composed in order like AR models do it, because they are using a recursive grammar, and this is especially true of computer languages. This is a bit like mad libs but madder libs. My question is, how could this possibly give better results than AR, it would need to perfectly converge on something with the right grammar context and the semantic meaning, while perfectly predicting early on the amount of tokens that would appear between words. Seems like there is some major impedance mismatch. |