| ▲ | coppsilgold 3 hours ago | |
It's possible that the gains are despite the noise the coarse process introduces. After two repetitions the noise may overwhelm the advantage. The residual connections resemble the Euler method (this observation led to Neural ODE's IIRC) which isn't known to be exactly clean. If the model has been trained to be a particular number of layers, adding more layers will also add a lot of noise. Ultimately, the LLM will need to be fine tuned with the loops or a looped architecture trained from scratch, such as: <https://ouro-llm.github.io> unfortunately they made the mistake of looping the entire LLM rather than just the center portion. | ||