▲	GistNoesis 2 months ago
		Fast, you gotta go fast : Let me draw the roadmap of this line of thinking. - Let's start by the traditional autoregressive LLM, where one token at a time is generated. It's a fundamentally sequential process which maps well to the sequential nature of writing as it goes. - Then to make the generation go faster you try to generate multiple token in one pass to parallelize more the sequential process with things like "look ahead decoding" - (<-- We are here) Then you realize that if your model isn't writing as it goes but rather forming an idea and pushing all at once you can instead use a diffusion model to generate the whole response, but you allow it to do number of diffusion steps edits to make all the errors that occurred during the generation disappear. Conceptually if number of diffusion steps == length of the sequence of token to generate, the diffusion process could generate tokens one at a time like a autoregressive LLM does. Usually 100 diffusion steps is a good starting point. - Now the goal is to reduce the number of diffusion steps to reduce computation cost. And the diffusion literature is already well furnished and in the image/video domain it was shown that you can reduce this number of diffusion steps to one (albeit with quality reduction) or two, with techniques like "consistency models". - Now that you only have a single diffusion step, you realize that you need to get speed-up elsewhere. You explore the literature and you realise that you can apply the trick you have already applied once, one more time. Compressing a few tokens into one, like you compressed multiple characters into one token. This allow to reduce the length of the sequence of tokens you need to generate by a factor 4. At the price of an additional decoding step. This decoding step can either be some form of "latent" encoding or some form of "hierarchical" encoding. So now you are consistency diffusing sentences vectors, which are then decoded into tokens sequences. But each step being smaller and transformer being quadratic the total speed-up is roughly a factor 10. But applying this trick multiple times get you diminishing return. Which you can partially compensate by increasing memory use (using a bigger "vocabulary" dictionary size). - To make it go faster you now have to dig into the internals of the transformer itself. You suddenly realise it is just a residual network applied "number of layers" time. Being a residual network this "sequence of internal step" 's goal is to refine the input into the output progressively. But you realise that it's the same thing which allows you to go from "number of diffusion steps" to a single diffusion step. You realise that you can compress your stack of layer into a single (bigger to keep capacity) layer, and let the diffusion correct the mistakes. - Now you have a single layer of transformer consistency model generating sentences vectors, you realise that transformers uses multiple heads to explore the space more efficiently but once training is done you can get by with a single head. Gaining an other 10x reduction of computation along the way. - Taking a step-up you realize that your transformer now is just doing a near-neighbor search and mixing the outputs. But it's doing it in a brute-force fashion. So you replace it with some approximate Near-neighbor search like HNSW vector database, decoupling computation from capacity, allowing you to scale-up by trading space for time. - But because Hierarchical Navigable Small World are just graphs under the hood, you realise that you just reinvented the Good Old Fashion Artificial Intelligence graph database ontology but in an emergent fashion with a graph being implicitly defined by some vector distance in a semantic space constructed in a way to make it easy to generate text once decoded appropriately. - So now you only need make your database explainable by mapping into human understandable labels and you reach the graal : SQL.