| ▲ | alyxya 2 days ago | |
I’ve had a hard time parsing what exactly the paper is trying to explain. So far I’ve understood that their comparison seems to be models within the same family and same weight tensor dimensions, so they aren’t showing a common subspace when there isn’t a 1:1 match between weight tensors in a ViT and GPT2. The plots showing the distribution of principal component values presumably does this on every weight tensor, but this seems to be an expected result that the principal component values shows a decaying curve like a log curve where only a few principal components are the most meaningful. What I don’t get is what is meant by a universal shared subspace, because there is some invariance regarding the specific values in weights and the directions of vectors in the model. For instance, if you were doing matrix multiplication with a weight tensor, you could swap two rows/columns (depending on the order of multiplication) and all that would do is swap two values in the resulting product, and whatever uses that output could undo the effects of the swap so the whole model has identical behavior, yet you’ve changed the direction of the principal components. There can’t be fully independently trained models that share the exact subspace directions for analogous weight tensors because of that. | ||
| ▲ | seeknotfind 2 days ago | parent [-] | |
Yeah, it sounds platonic the way it's written, but it seems more like a hyped model compression technique. | ||