Remix.run Logo
hiimshort 2 days ago

I have been wondering about the use of diffusion techniques for text generation, it is nice to see Google release a model that, seemingly, validates some thoughts I had.

Most folks I have seen experimenting with AI are either using a paid service or running high-grade hardware (even if consumer-level). The best I have in my current repertoire is a 5700XT and am not able to upgrade from that yet. The limitation, though, has at least also given some more significant insights into the shortcomings of current models.

Model sizes have gotten quite large and coherence seems to mostly have scaled with the density of a model, leaving the smaller models useful for only smaller tasks. Context size is also extremely important from my experiments with long-running dialogues and agent sessions, but a smaller GPU simply cannot fit a decent model and enough context at the same time. I do wonder if diffusion techniques will allow for a rebalancing of this density-to-coherence connection, letting smaller models produce chunks of coherent text even if limited by context. From my viewpoint it seems it will. Mixed tool call + response outputs also have the potential to be better.

Speed is also another problem I, and everyone else, has had with modern LLMs. The nature of cycling around the input with a new additional output each time is time consuming. On an older GPU with no AI-specific hardware it is an eternity! Being able to at least track 0-100% progress state would be an improvement from the current solution. At the moment one must simply wait for the LLM to decide to stop (or hit the max number of inference tokens). I am hopeful that, even on lower-end GPUs, a diffusion model will perform slightly better.

This does now beg several questions. If we are processing noise, where does the noise come from? Is there a good source of noise for LLMs/text specifically? Is the entire block sized beforehand or is it possible to have variable length in responses?