Remix clone Hacker News

new | show | ask | jobs Github

	▲	DoctorOetker 5 hours ago
		> Mercury 2 doesn't decode sequentially. It generates responses through parallel refinement, producing multiple tokens simultaneously and converging over a small number of steps. Less typewriter, more editor revising a full draft at once. There has been quite some progress unifying DDPM & SGM as SDE > DDPM and Score-Based Models: The objective function of DDPMs (maximizing the ELBO) is equivalent to the score matching objectives used to train SGMs. > SDE-based Formulation: Both DDPMs and SGMs can be unified under a single SDE framework, where the forward diffusion is an Ito SDE and the reverse process uses score functions to recover data. > Flow Matching (Continuous-Time): Flow matching is equivalent to diffusion models when the source distribution corresponds to a Gaussian. Flow matching offers "straight" trajectories compared to the often curved paths of diffusion, but they share similar training objectives and weightings. Is there a similar connection between modern transformers and diffusion? Suppose we look at each layer or residual connection between layers, the context window of tokens (typically a power of 2), what is incrementally added to the embedding vectors is a function of the previous layer outputs, and if we have L layers, what is then the connection between those L "steps" of a transformer and similarly performing L denoising refinements of a diffusion model? Does this allow fitting a diffusion model to a transformer and vice versa?