Remix.run Logo
4bpp 5 hours ago

Assuming the benchmarks are sound (rather than capturing a fluke), the provided explanation still does not pass the smell test. As far as I can tell, there is nothing about the training process of these models that would encourage them to make the output of any layer apart from (n-1) meaningful as the input of layer n, unless perhaps these layers were initialised as identity and the training process did not get to change them much. (Plausible for middle layers?)

Considering this, I think (again, assuming the benchmarks themselves are sound) the most plausible explanation for the observations is (1) the layers being duplicated are close to the identity function on most inputs; (2) something happened to the model in training (RLHF?) that forcefully degraded its reasoning performance; (3) the mechanism causing the degradation involves the duplicated layers, so their duplication has the effect of breaking the reasoning-degrading mechanism (e.g. by clobbering a "refusal" "circuit" that emerged in post-training).

More concisely, I'm positing that this is an approach that can only ever break things, and rather than boosting reasoning, it is selectively breaking things deleterious to reasoning.

kirill5pol 3 hours ago | parent [-]

Basically all of them are using residual connections so it’s not that surprising honestly