Remix clone Hacker News

new | show | ask | jobs Github

	▲	vlovich123 2 hours ago
		> Ever read a DeepSeek paper? Ever hear of MLA? Mamba? Or gated deltanet? Or RLMs? Universal transformers? Quite a few of those aren’t transformer architectures, MLA is more of KV optimization that doesn’t degrade intelligence than something that directly improves intelligences. Indirectly it lets you run a larger model on the same hardware but that’s it. It’s also 2 years old while universal transformers are 8 years old and only MLA has seen adoption. Your reply was full of gish gallop nonsense that argues against anything really new in transformers capabilities with intelligence. > Not arguing that. I just don't accept that the path to generality goes through giving up "transformers", whatever this term means after the architectural Cambrian explosion of the past few years. I mean dLLMs are quite architecturally different from plain LLM transformers that you get on OpenAI or Anthropic today even if the use transformers if you squint at them - they’re bidirectional thinking and embarrassingly parallel. Why would the next explosion not be architecturally different from the previous one? Indeed you’d expect a difference because anything that can overcome today’s transformers has to be exponentially better and anything based around transformers won’t be and there’s clearly still a few orders of magnitude between humans and LLMs. > Sure, these things aren't pure transformers. But neither are frontier models. The industry is already doing what you suggest and moving beyond naive KQ dot product full depth everywhere 2010s-era transformers. But they’re not, not really. The difference between Llama 3.2 and Claude Fabel architecturally is relatively small, with most of the gains coming from RL, training data, size, training systems, and inference loop infrastructure. It’s all clearly made a huge difference but structurally there hasn’t been huge structural changes; most of the structural changes are around inference efficiency and trying to optimize performance without sacrificing intelligence. At some point you’ll run out of headroom of how far you can take that and that point will be a far way away from AGI.