Remix.run Logo
impossiblefork 2 days ago

I am not sure.

In principle one would imagine that models of this type would have an advantage-- you can use information from both the left and right, etc. and in practice I've found LLaDA to be impressive considering its size and my assumption that they have had small training resources, but they are behind in perplexity, and I think this is unavoidable. They also become rather fixed early, so I don't believe fully in these hopes to be able to really correct text deeply (although they will of course be able to correct their partially completed texts to some degree, especially when it's just a word or two that are wrong, but I believe that the words that are wrong basically need to get masked simultaneously, so 1/masking_probability^2, and 1/masking_probability^3 for three and so on).

Despite this I've been happy with the practical results I've seen during my experimentation.