Remix.run Logo
jmward01 7 days ago

I have an internal repo that does guided window attn. I figured out that One Weird Trick to get the model to learn how to focus so that you can move a fixed window around instead of full attn. I also built NNMemory (but that appears to be an idea others hae had now too [1]) and I have a completely bonkers mechanism for non-determanistic exit logic so that the model can spin until it thinks it has a good answer. I also built scale free connections between layers to completely remove residual connections. Plus some crazy things on sacrificial training (adding parameters that are removed after training in order to boost training performance with no prod penalty). There are more crazy things I have built but they aren't out there in the wild, yet. Some of the things I have built are in my repo. [2] I personally think we can get .5b models to outperform 8b+ SOTA models out there today (even the reasoning models coming out now)

The basic transformer block has been good at kicking things off, but it is now holding us back. We need to move to recurrent architectures again and switch to fixed guided attn windows + 'think' only layers like NNMemory. Attn is distracting and we know this as humans because we often close our eyes when we think hard about a problem on the page in front of us.

[1] https://arxiv.org/abs/2502.06049

[2] https://github.com/jmward01/lmplay