| ▲ | mountainriver 2 days ago |
| Diffusion is more than just speed. Early benchmarks show it better at reasoning and planning pound for pound compared to AR. This is because it can edit and doesn’t suffer from early token bias. |
|
| ▲ | martincsweiss 2 days ago | parent | next [-] |
| This is a super interesting claim - can you point to these benchmarks? |
| |
| ▲ | cubefox 2 days ago | parent | next [-] | | https://deepmind.google/models/gemini-diffusion/#benchmarks > Gemini Diffusion’s external benchmark performance is comparable to much larger models, whilst also being faster. That doesn't necessarily mean that they scale as well as autoregressive models. | | |
| ▲ | jimmyl02 2 days ago | parent [-] | | I think there is no way to tell and we can only see with more research and time. One nuanced part that might not be clear is the transformer was a huge part of what made traditional LLMs scale. With the diffusion transformer and newer architectures, it might be possible that transformers can now be applied to diffusion. Diffusion also has the benefit of being able to "think" with the amount of diffusion steps instead of having to output tokens and then reasoning about them. I think it's hard to tell exactly where we are headed but it's an interesting research direction especially now that it's somewhat more validated by Google. |
| |
| ▲ | mdp2021 2 days ago | parent | prev | next [-] | | Try this one: # d1: Scaling Reasoning in Diffusion Large Language Models via Reinforcement Learning https://dllm-reasoning.github.io/ | |
| ▲ | mountainriver 2 days ago | parent | prev [-] | | https://github.com/HKUNLP/diffusion-vs-ar | | |
|
|
| ▲ | hansvm 2 days ago | parent | prev | next [-] |
| AR doesn't inhibit long planning processes, but some popular, modern instantiations of AR have that flaw. AR in general is critical for learning the right distribution. |
| |
| ▲ | mdp2021 2 days ago | parent [-] | | > AR in general is critical for learning the right distribution Could you please clarify that? | | |
| ▲ | hansvm 2 days ago | parent [-] | | Assuming your goal is mimicking the training data, you need some mechanism for drawing from the same distribution. AR happens to provide that -- it's a particular factorization of conditional probabilities which yields the same distribution you started with, and it's one you're able to replicate in your training data. AR is not the only possible solution, but many other techniques floating around do not have that property of actually learning the right thing. Moreover, since the proposed limitation (not being able to think a long time about your response before continuing) is a byproduct of current architectures rather than a fundamental flaw with AR, it's not as obvious as it might seem that you'd want to axe the technique. |
|
|
|
| ▲ | vessenes 2 days ago | parent | prev [-] |
| A claim I believe (or want to) but can you point to any papers about this? I haven’t seen any papers at all or demos showing a revise diffusion text step. I’d reallly like to use one though. |