| ▲ | kacesensitive 2 days ago | |||||||||||||||||||||||||||||||||||||
interesting.. this could make training much faster if there’s a universal low dimensional space that models naturally converge into, since you could initialize or constrain training inside that space instead of spending massive compute rediscovering it from scratch every time | ||||||||||||||||||||||||||||||||||||||
| ▲ | tsurba 2 days ago | parent | next [-] | |||||||||||||||||||||||||||||||||||||
You can show for example that siamese encoders for time-series, with MSE loss on similarity, without a decoder, will converge to the the same latent space up to orthogonal transformations (as MSE is kinda like gaussian prior which doesn’t distinguish between different rotations). Similarly I would expect that transformers trained on the same loss function for predicting the next word, if the data is at all similar (like human language), would converge to approx the same space. And to represent that same space probably weights are similar, too. Weights in general seem to occupy low-dimensional spaces. All in all, I don’t think this is that surprising, and I think the theoretical angle should be (have been?) to find mathematical proofs like this paper https://openreview.net/forum?id=ONfWFluZBI | ||||||||||||||||||||||||||||||||||||||
| ▲ | moelf 2 days ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||||||||
>instead of spending massive compute rediscovering it from scratch every time it's interesting that this paper was discovered by JHU, not some groups from OAI/Google/Apple, considering that the latter probably have spent 1000x more resource on "rediscovering" | ||||||||||||||||||||||||||||||||||||||
| ▲ | bigbuppo 2 days ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||||||||
Wouldn't this also mean that there's an inherent limit to that sort of model? | ||||||||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||||||||
| ▲ | odyssey7 2 days ago | parent | prev [-] | |||||||||||||||||||||||||||||||||||||
Or an architecture chosen for that subspace or some of its properties as inductive biases. | ||||||||||||||||||||||||||||||||||||||