| ▲ | nothrowaways 2 days ago |
| > Principal component analysis of 200 GPT2, 500 Vision Transformers, 50 LLaMA-
8B, and 8 Flan-T5 models reveals consistent sharp spectral decay - strong evidence that a small number of weight
directions capture dominant variance despite vast differences in training data, objectives, and initialization. Isn't it obvious? |
|
| ▲ | stingraycharles 2 days ago | parent | next [-] |
| Well intuitively it makes sense that within each independent model, a small number of weights / parameters are very dominant, but it’s still super interesting that these can be swapped between all the models without loss of performance. It isn’t obvious that these parameters are universal across all models. |
|
| ▲ | levocardia 2 days ago | parent | prev | next [-] |
| This general idea shows up all over the place though. If you do 3D scans on thousands of mammal skulls, you'll find that a few PCs account for the vast majority of the variance. If you do frequency domain analysis of various physiological signals...same thing. Ditto for many, many other natural phenomena in the world. Interesting (maybe not surprising?) to see it in artificial phenomena as well |
| |
| ▲ | vintermann 2 days ago | parent [-] | | It's almost an artifact of PCA. You'll find "important" principal components everywhere you look. It takes real effort to construct a dataset where you don't. That doesn't mean though, for instance, that throwing away the less important principal components of an image is the best way to compress an image. |
|
|
| ▲ | mlpro 2 days ago | parent | prev [-] |
| Not really. If the models are trained on different dataset - like one ViT trained on satellite images and another on medical X-rays - one would expect their parameters, which were randomly initialized to be completely different or even orthogonal. |
| |
| ▲ | energy123 2 days ago | parent | next [-] | | Every vision task needs edge/contrast/color detectors and these should be mostly the same across ViTs, needing only a rotation and scaling in the subspace. Likewise with language tasks and encoding the basic rules of language which are the same regardless of application. So it is no surprise to see intra-modality shared variation. The surprising thing is inter-modality shared variation. I wouldn't have bet against it but I also wouldn't have guessed it. I would like to see model interpretability work into whether these subspace vectors can be interpreted as low level or high level abstractions. Are they picking up low level "edge detectors" that are somehow invariant to modality (if so, why?) or are they picking up higher level concepts like distance vs. closeness? | | |
| ▲ | TheOtherHobbes a day ago | parent [-] | | It hints there may be common higher-level abstraction and compression processes in human consciousness. The "human" part of that matters. This is all human-made data, collected from human technology, which was created to assist human thinking and experience. So I wonder if this isn't so much about universals or Platonic ideals. More that we're starting to see the outlines of the shapes that define - perhaps constrict - our own minds. |
| |
| ▲ | crooked-v 2 days ago | parent | prev [-] | | Now I wonder how much this "Universal Subspace" corresponds to the same set of scraped Reddit posts and pirated books that apparently all the bigcorps used for model training. Is it 'universal' because it's universal, or because the same book-pirating torrents got reused all over? |
|