Did you try repeating the same mid-layer block more than once?

If the gain comes from giving the model another pass over its internal representation, I'd expect some sort of diminishing-returns curve as you add more repeats. But if those layers form a spevific circuit, running it multiple times might actually break the computation.

It would be really interesting to see which of those regims the model falls into.

▲

dnhkng a day ago | parent [-]

Yes!

I tried that pretty early on, the its basically never good. Its described in the the section: https://dnhkng.github.io/posts/rys/#the-beginning-of-llm-neu...

	▲	efromvt a day ago \| parent [-]
		If you found two disjoint sections that seemed positive on their own, did you try looping both separately in the same model? Wondering how localized the structures are.