| ▲ | BloodAndCode a day ago | |||||||
Did you try repeating the same mid-layer block more than once? If the gain comes from giving the model another pass over its internal representation, I'd expect some sort of diminishing-returns curve as you add more repeats. But if those layers form a spevific circuit, running it multiple times might actually break the computation. It would be really interesting to see which of those regims the model falls into. | ||||||||
| ▲ | dnhkng a day ago | parent [-] | |||||||
Yes! I tried that pretty early on, the its basically never good. Its described in the the section: https://dnhkng.github.io/posts/rys/#the-beginning-of-llm-neu... | ||||||||
| ||||||||