Remix.run Logo
nodja 2 days ago

This is insanely fast, my guess is that the tradeoff here is that the GPUs will always be working at max capacity and there will be minimal compute savings from batching, which I realize now is not really a tradeoff.

My only worry is that the diffusion objective will be worse than AR in terms of model capabilities, if that's the case hopefully multi-token AR models will perform as well as diffusion, or we can use this as a draft model for speculative decoding.

mdp2021 2 days ago | parent | next [-]

Why do you suspect dLLMs should not match (or surpass) arLLMs in quality? The general idea is that it is easier to treat the output as a structured whole (idea, points, concepts, words - in a tree) which is iteratively treated - that should go in the direction of "proper" quality.

pama 2 days ago | parent | next [-]

Another intuition is simply that anytime your causal relationships in the training data are sequential you are having a lower probability of getting the correct token at a certain position because you have less of the causal information leading up to that position than you would have with AR and thus during training you almost always have a worse model with near certainty (think of the words in a function of source code, even if some of the functions are unsorted and thus a tree at the high level). Imagine you somehow already have N tokens in a sequence: is it easier to next predict token N+1 or N+15? I do like the performance tradeoff for some usecases though and I hope we see more models soon. For image tokens my argument does not hold because causality is not as clear as for text, math, code, or timeseries.

nodja 2 days ago | parent | prev [-]

My intuition is that the harder it is for an LLM to do something during training the more actual compression/learning will be encoded in it's weights. With multi-token/diffusion it becomes much easier to "reward/loss hack" your way, this won't matter much during pretraining, but I assume a lot of "cheating" will happen in the finetune/RL phase.

manmal 2 days ago | parent | prev [-]

This tradeoff will be great for self hosted LLMs, because they don’t need large scale batching usually, and less great for cloud providers that do.