There are definitely parallels between diffusion and reasoning models, mostly being able to spend longer to get a better solution by using a more precise ODE solver for diffusion or using more tokens for reasoning.

However, due to how diffusion models are trained, they never see their own predictions as input, so they cannot learn to store information across steps. This is the complete opposite for reasoning models.

▲

macawfish 2 days ago | parent | next [-]

I'm probably not understanding your point but did you look at the paper? This explicitly does diffusion in an autoencoded latent space of the autoregressive prediction itself. The starting point is that prediction, but depending on how much noise is used, the diffusion model itself directly contributes to the prediction process to some degree or another.

It should be trivial to make an encoder that has some memory of at least part of the prompt (say the tailing part) and do a diffusion step there too.

▲

yorwba 2 days ago | parent | prev [-]

You can train a diffusion model using its own predictions as input, no problem at all.

▲

tripplyons 2 days ago | parent [-]

At that point it is not following a diffusion training objective. I am aware of papers that do this, but I have not seen one that shows it as a better pretraining objective than something like v-prediction or flow matching.

	▲	mxwsn 2 days ago \| parent [-]
		Why is not the diffusion training objective? The technique is known as self-conditioning right? Is it an issue with conditional Tweedie's?