Remix.run Logo
farhanhubble 2 days ago

Would you see a lower rank subspace if the learned weights were just random vectors?

imtringued 9 hours ago | parent [-]

This is a good point, but I think this only works for D*A, where D=Sigma is a diagonal matrix with learnable parameters. It probably doesn't work for a full singular value decomposition (SVD) UDV^T.

Basically, what if we're not actually "training" the model, but rather the model was randomly initialized and the learning algorithm is just selecting the vectors that happen to point into the right direction? A left multiplication of the form D*A with a diagonal matrix is equivalent to multiplying each row in A with the corresponding diagonal element. Low values mean the vector in question was a lottery blank and unnecessary. High values means that this turns out to be correct vector, yay!

But this trivial explanation doesn't work for the full SVD, because you now have a right multiplication U*D. This means each column gets multiplied against the corresponding diagonal element. Both the column in U and row vector in V^T have to perfectly coincide to make the "selection" theory work, which is unlikely to be true for small models, which happen to work just fine.*