| ▲ | dnhkng 7 hours ago | |
Author here: That was done in this blog post, in the beam search. I started with the best re-layer configs, and iteratively added more blocks, including the same multiple times, during a long beam search. It turns out this does not help (somewhat surprisingly). | ||
| ▲ | coppsilgold 2 hours ago | parent | next [-] | |
It's possible that the gains are despite the noise the coarse process introduces. After two repetitions the noise may overwhelm the advantage. The residual connections resemble the Euler method (this observation led to Neural ODE's IIRC) which isn't known to be exactly clean. If the model has been trained to be a particular number of layers, adding more layers will also add a lot of noise. Ultimately, the LLM will need to be fine tuned with the loops or a looped architecture trained from scratch, such as: <https://ouro-llm.github.io> unfortunately they made the mistake of looping the entire LLM rather than just the center portion. | ||
| ▲ | skyde 6 hours ago | parent | prev [-] | |
Actually not surprised. I guess this is for the same reason “say it twice” [1] is working. Because LLm are trained as causal language model, past token cannot attend to future token. One copy of the layer set solve this. [1]https://arxiv.org/html/2512.14982v1 | ||