Remix.run Logo
DoctorOetker 5 hours ago

> Mercury 2 doesn't decode sequentially. It generates responses through parallel refinement, producing multiple tokens simultaneously and converging over a small number of steps. Less typewriter, more editor revising a full draft at once.

There has been quite some progress unifying DDPM & SGM as SDE

> DDPM and Score-Based Models: The objective function of DDPMs (maximizing the ELBO) is equivalent to the score matching objectives used to train SGMs.

> SDE-based Formulation: Both DDPMs and SGMs can be unified under a single SDE framework, where the forward diffusion is an Ito SDE and the reverse process uses score functions to recover data.

> Flow Matching (Continuous-Time): Flow matching is equivalent to diffusion models when the source distribution corresponds to a Gaussian. Flow matching offers "straight" trajectories compared to the often curved paths of diffusion, but they share similar training objectives and weightings.

Is there a similar connection between modern transformers and diffusion?

Suppose we look at each layer or residual connection between layers, the context window of tokens (typically a power of 2), what is incrementally added to the embedding vectors is a function of the previous layer outputs, and if we have L layers, what is then the connection between those L "steps" of a transformer and similarly performing L denoising refinements of a diffusion model?

Does this allow fitting a diffusion model to a transformer and vice versa?