| ▲ | phire 10 hours ago | |
That's really interesting. Makes me immediately ask two questions: 1. Should we be training models like this from the start? It seems that a model trained with layer loops would be able to take advantage of it better than rearranging the layers of a naive model. 2. Should we even be using a fixed number of layers? If models are this tolerant to their inner layers being meddled with, then it doesn't make sense to run all the layers on every single token. Maybe we could make a model that changed the number of iterations through the compute layers based on how much computation it thought the problem needed. Send it through only once for easy problems (perhaps even zero times?) and two or more times for harder problems. This would allow easier prompts to complete faster, while allowing the model to potentially scale up to infinity hard problems. If we are training or fine tuning the model, we can probably make the compute layers generate a confidence signals based that predicts how likely it is for an extra compute iteration to meaningfully change the result. | ||