I have had broadly the same intuitions on the use of middle layers, but haven't had much luck with the tiny models that I can run on my hardware.

There's a video on YouTube https://www.youtube.com/watch?v=pDsTcrRVNc0

about a looping layer models, after watching that I poured some thoughts off the top of my head into a comment which, of course, promptly sunk without a trace. I'll repost the gist of them here.

If you gain benefit from looping layers, at some level every layer of parameters is in front of and behind every other, the conclusion must be that the order of the layers does not need to be fixed at all.

If you cycle through the layers multiple times, are you doing so for the benefit of a particular layer on a particular problem. If so, can you skip the other layers that don't add on repetition. If you can skip (and you can know when to skip), and you can repeat (and know when to repeat)

What you would need is a mechanism which can decide which layer is needed next. Is that then not a looping single layer MOE model? Storing the layers as a wide set of selectable options rather than a deep set of unconditional layers. You would be picking what the next layer should be (or exit the loop) the threshold for exit drops each iteration so it always eventually exits. With a tunable 'how hard to think' knob to adjust the threshold.

▲

janalsncm 21 hours ago | parent | next [-]

That is an interesting idea. I suspect if we relax the constraint that most of the layers in a loop will be in order, there is a combinatorial explosion issue.

But we could still try it out: randomize the order we call the transformer blocks, and see if it affects performance. If not, that’s extremely interesting.

	▲	Lerc 16 hours ago \| parent [-]
		You can still consider it logically from the point of view of in-order with optional looping and optional skipping. It stops being so combinationally explodey then but if you can always append an additional loop and and decide to skip based on worthiness of the layer with varying degrees of threshold then it could theoretically learn an arbitrary ordering where you skip all-bar-one layer per loop. There's probably a number of common sequences of layers that are inevitable when working on a problem though. I think of it like a expression calculator which could do various parts of an expression tree merging leaf nodes on each iteration. I wouldn't expect it to be quite so explicit with neural nets, but I feel like the underlying principle of do the sub parts then do the same thing on the result of the subparts must be beneficial to some degree. I think there's probably quite a lot to be revealed from study of representations in those middle layers. If there's a 'how-much-have-we-solved-so-far' signal to be detected from the data between layers, there would be quite a lot of options I think.

▲

a day ago | parent | prev [-]

[deleted]