Remix.run Logo
shreezus 2 days ago

Is anyone else totally blown away by this? I feel like it’s easily the biggest announcement out of IO, however it’s been overshadowed by Veo 3 etc.

Diffusion models for code generation are a big deal. If they are using transformers this would likely fall into the DiT bucket (diffusion transformers). I had previously worked on use cases that leveraged U-Net diffusion several years ago and there was quite a bit of interest in hybrid models. I expect to see further leaps in the diffusion space in the near future.

theptip 2 days ago | parent | next [-]

Can someone help with the intuition here? My understanding from vision transformers is you start with noise and use a series of hierarchical models to iteratively refine the noise into the target. Each layer is trained to produce images at an increasing resolution, and by layering them you skip the problem of sparse gradients at the beginning to get from “noise” to “noise that kinda looks like a face”.

How does this work for coding? It would require you to be able to hierarchically structure the emitted artifacts. Maybe this sort of works; low granularity concepts like “use Django for this problem”, then “I need these endpoints” then “emit the code”. But AIUI diffusion doesn’t have a mechanism for backtracking, so you can’t feed back signals from the detailed layers to the “higher abstraction” layers at the top of your need to change an aspect of the design in response to a low-level problem.

Whereas transformers, you go through the whole model for each token and therefore can deploy all your smarts and logic at each step of the problem (if needed), including backtracking on key design decisions.

I’m sure my mental model has some big gaps, would appreciate any insights.

nvtop 2 days ago | parent | next [-]

Despite the name, diffusion LMs have little to do with image diffusion and are much closer to BERT and old good masked language modeling. Recall how BERT is trained:

1. Take a full sentence ("the cat sat on the mat") 2. Replace 15% of tokens with a [MASK] token ("the cat [MASK] on [MASK] mat") 3. Make the Transformer predict tokens at masked positions. It does it in parallel, via a single inference step.

Now, diffusion LMs take this idea further. BERT can recover 15% of masked tokens ("noise"), but why stop here. Let's train a model to recover texts with 30%, 50%, 90%, 100% of masked tokens.

Once you've trained that, in order to generate something from scratch, you start by feeding the model all [MASK]s. It will generate you mostly gibberish, but you can take some tokens (let's say, 10%) at random positions and assume that these tokens are generated ("final"). Next, you run another iteration of inference, this time input having 90% of masks and 10% of "final" tokens. Again, you mark 10% of new tokens as final. Continue, and in 10 steps you'll have generated a whole sequence. This is a core idea behind diffusion language models.

Of course, there are some optimizations in the real world. If you need to generate a really long text (over 200 tokens), you'd better split it in chunks and fully generate the first chunk in parallel before moving to the next one. This semi-autoregressive generation is what Block Diffusion does.

You can be smart about how exactly you pick tokens you consider generated and what % exactly. At earlier stages, when it's mostly noise, you can take more, and on final stages you can do more iterations and take fewer tokens.

All in all, diffusion LMs are still iterative, but the number of steps is much lower than in autoregressive models. A nice thing is that you can choose how many steps are you going to make, trading quality for speed.

In the extreme, you can even generate just one leftmost masked token with a diffusion LM, effectively turning it into a traditional causal language model.

yahoozoo 2 days ago | parent | next [-]

Great explanation. I think I have seen where text diffusion models can “edit” as it’s running inference. Or in other words, a “final” token isn’t necessarily “final” and could change but at some later iteration the model decides it truly is. How does that work?

nvtop 2 days ago | parent [-]

Correct, diffusion LMs can edit their intermediate predictions, so "final" tokens aren't necessarily final. This is an exciting property because it allows models to correct errors in what's generated so far -- something that GPT-like models can't.

This editing is based on the Transformer's encoder property to predict token probabilities for __every__ token in a sequence, not just for [MASK]s. So when you input a sentence of three tokens `[MASK] cat barks`, Transformer will generate a probability distribution over the vocabulary for each of the three tokens, for free.

Now you can come up with many ways of how to decide whether you want to edit token or keep it as is. In the simplest case, take a new token if its probability higher than the original by some margin. In our example, say model returns the probability of the token "cat" on the second position as p_2("cat") = 0.3, while p_2("dog") = 0.6. We may want to replace "cat" with dog, and use it in the subsequent iterations.

Actual heuristics are slightly more complicated, but the base idea is this.

P.S. In order to teach LM not to just copy input unmasked tokens but to try to find a better replacement, your training objective should include replacing some % of input tokens with some other random token. Now you have part of the input masked, and part of the input corrupted, so the model can't blindly assume that all input tokens are here to stay.

paulsmith a day ago | parent | next [-]

> say model returns the probability of the token "cat" on the second position as p_2("cat") = 0.3, while p_2("dog") = 0.6. We may want to replace "cat" with dog, and use it in the subsequent iterations.

Might one tradeoff of speed/quality be a tree "search" for better outcomes by branching on logit choices? If a diffusion model is so much faster overall than AR, then I might not mind that I hunt or backtrack for the best probabilities overall.

skydhash 2 days ago | parent | prev [-]

But what about the dependency graph between symbols in the program. Because all those symbols have high constraints around them which is the program design.

The issue comes in image diffusion as well. When you ask it for a portrait and some details are wrong. That’s because the face has constraints (which you learn about as an artist). Patterns and probability won’t help you.

angusturner 2 days ago | parent [-]

You assume that for small steps (I.e taking some noisy code and slightly denoising) you can make an independence assumption. (All tokens conditionally independent, given the current state).

Once you chain many steps you get a very flexible distribution that can model all the interdependencies.

A stats person could probably provide more nuance, although two interesting connection I’ve seen: There is some sense in which diffusion generalises autoregression, because you don’t have to pick an ordering when you factor the dependency graph.

(Or put otherwise, for some definitions of diffusion you can show autoregression to be a special case).

skydhash a day ago | parent [-]

There’s a reason we have formal verification as the highest guarantee for software. To ensure that we have a complete assurance of what the program can and can not do, the semantic of each of its components needs to be known. Recursively.

A minor change in one token can change the meaning of the whole software. Programming is just trying to enforce semantics on instructions (how well is that done is software engineering’s realm)

An algorithm like merge sort is just semantic constraints. Which is why most books go with their own notations as code does not really matter.

At most, LLMs and diffusion can be regarded as fancy searches. But, what you actually want is semantics and that’s why you can design lots of stuff on paper. But we do it with the code editor because feedbacks are nice and libraries’ documentations (if they exist) lie about their semantics. And we read code because there’s nothing more complete about semantics than that.

oliwary 2 days ago | parent | prev | next [-]

Fascinating, and great explanation.

What about insert and delete operations however? Isn't there a risk of there being too few tokens to properly finish the code in-between the "final" tokens?

Workaccount2 2 days ago | parent | prev | next [-]

Can you have a hybrid model that can do autoregression and diffusion? It doesn't seem like there is something that would fundamentally prevent this. A model with diffusion CoT for rapid "thought" generation, and then autoregression for the answer on the output.

nvtop a day ago | parent [-]

You can absolutely do it, and I think it's a nice idea to try.

shawntan a day ago | parent | prev | next [-]

I'm curious how the speed is achieved is this is the technique used. Generally I expected this "masked language model" technique to be far slower since the full vocab projection needs to be computed every iteration.

I always thought the eventual technique would be some form of diffusion in continuous space, then decoding into the discrete tokens.

Also I'm guessing this is a "best guess" of how Gemini Diffusion is done?

victorbjorklund 2 days ago | parent | prev | next [-]

Thanks. Best explanation of text diffusion.

ctxc 2 days ago | parent | prev | next [-]

Thank you for the explanation!

moralestapia 2 days ago | parent | prev [-]

Whoa man, thanks.

This is a great explanation.

yorwba 2 days ago | parent | prev | next [-]

You could downscale text the same way you downscale images, by averaging token embeddings instead of pixel values. But you don't have to. AFAIK vision transformers don't suffer from sparse gradients that need a resolution hierarchy to overcome, downscaling is just a performance optimization, because processing an image at full resolution is expensive.

sroussey a day ago | parent [-]

So downscaling will summarize?

pertymcpert 2 days ago | parent | prev [-]

I have the exact same questions as you. I can barely understand how diffusion works for images, for sequential data like text it makes no sense to me.

janalsncm 2 days ago | parent [-]

Let’s suppose we have 10k possible tokens in the vocabulary.

Then text would be an image 10k pixels tall and N pixels wide, where N is the length of the text.

For each column, exactly 1 pixel is white (corresponding to the word which is there) and the rest are black.

Then the diffusion process is the same. Repeatedly denoising.

moralestapia 2 days ago | parent [-]

No, that intuition is incorrect.

Denoising models work because a lot of regions turn out to be smooth, you cannot do that "in a discrete way" if that makes sense.

janalsncm a day ago | parent | next [-]

Feel free to give a better explanation. I am not an expert. Clearly denoising models do work on text though.

moralestapia a day ago | parent [-]

This one's closer to the thing.

https://news.ycombinator.com/item?id=44059646

lostmsu a day ago | parent | prev [-]

They may be smooth in embedding space

bredren 2 days ago | parent | prev | next [-]

> however it’s been overshadowed by Veo 3 etc.

Because it’s simple to understand the power and difference in capability of Veo 3.

Understanding important steps forward in text completion requires understanding the value of what we have already and potential implications. Many people are not yet convinced LLMs are valuable for coding at all.

NitpickLawyer 2 days ago | parent | prev | next [-]

> Diffusion models for code generation are a big deal.

This is my intuition as well, as there are a lot of low-hanging fruits that a model like this could tackle in coding:

- you should be able to have a workflow where you constrain the generation w/ a function definition, and its output, and "generate" the tokens in between. Kind of like constrained generation but with the model being able to attend to tokens both ways.

- you should also be able to use a 2 step workflow like first writing a high level description of the function layout (think "write the chapters for an article on x" from LLMs) and then ping-pong between the actual implementations ("and now write chapter x"), using larger and larger context, using proxies like linters, code compilation, AST derived info, etc. for signals of "completion". Lots of things to be tried here indeed.

janalsncm 2 days ago | parent | next [-]

That’s kind of hard though, right? If we have a rule that only B can follow A, and token at position 5 changes to an A you will have a cascade of constraints to follow.

bn-l 2 days ago | parent | prev [-]

Like in-painting except code?

impossiblefork 2 days ago | parent | prev | next [-]

I am not sure.

In principle one would imagine that models of this type would have an advantage-- you can use information from both the left and right, etc. and in practice I've found LLaDA to be impressive considering its size and my assumption that they have had small training resources, but they are behind in perplexity, and I think this is unavoidable. They also become rather fixed early, so I don't believe fully in these hopes to be able to really correct text deeply (although they will of course be able to correct their partially completed texts to some degree, especially when it's just a word or two that are wrong, but I believe that the words that are wrong basically need to get masked simultaneously, so 1/masking_probability^2, and 1/masking_probability^3 for three and so on).

Despite this I've been happy with the practical results I've seen during my experimentation.

spiderfarmer 2 days ago | parent | prev [-]

Not really only because I saw it demoed before: https://www.inceptionlabs.ai

TeMPOraL 2 days ago | parent [-]

Right. It's not novel, but it's great to see this getting fully mainstream.