Remix.run Logo
godelski 3 days ago

I think there's two maybe subtle, but key concepts you're missing.

  1) "pertaining"
  2) architecture
1) Yes, they're trained on different data but "tune" implies most of the data is identical. So it should be surprising if the models end up significantly different.

2) the architecture and training methods matter. As a simple scenario to make things a bit easier to understand let's say we have two models with identical architectures and we'll use identical training methods (e.g. optimizer, learning rate, all that jazz) but learn on different data. Also to help so you can even reproduce this on your own let's train one on MNIST (numbers) and the other in FashionMNIST (clothing).

Do you expect these models to have similar latent spaces? You should! This is because despite the data being very different visually there are tons of implicit information that's shared (this is a big reason we do tuning in the first place!). One of the most obvious things you'll see is subnetworks that do edge detection (there's a famous paper showing this with convolutions but transformers do this too, just in a bit different way). The more similar the data (orders shouldn't matter too much with modern training methods but it definitely influences things) the more similar this will be too. So if we trained on LAION we should expect it to do really well on ImageNet because even if there aren't identical images (there are some) there are the same classes (even if labels are different)[0].

If you think a bit here you'll actually realize that some of this will happen even if you change architectures because some principles are the same. Where the architecture similarity and training similarity really help is that they bias features being learned at the same rate and in the same place. But this idea is also why you can distill between different architectures, not just by passing the final output but even using intermediate information.

To help, remember that these models converge. Accuracy jumps a lot in the beginning then slows. For example you might get 70% accuracy in a few epochs but need a few hundred to get to 90% (example numbers). So ask yourself "what's being learned first and why?" A lot will make more sense if you do this.

[0] I have a whole rant on the indirect of saying "zero shot" on ImageNet (or COCO) when trained in things like LAION or JFT. It's not zero shot because ImageNet is in distribution! We wouldn't say "we zero shotted the test set" smh