Remix.run Logo
farhanhubble 2 days ago

It might we worth it to use that subset to initialize the weights of future models but more importantly you could save a huge number of computational cycles by using the lower dimensional weights at the time of inference.

westoncb 2 days ago | parent [-]

Ah interesting, I missed that possibility. Digging a little more though my understanding is that what's universal is a shared basis in weight space, and particular models of same architecture can express their specific weights via coefficients in a lower-dimensional subspace using that universal basis (so we get weight compression, simplified param search). But it also sounds like to what extent there will be gains during inference is in the air?

Key point being: the parameters might be picked off a lower dimensional manifold (in weight space), but this doesn't imply that lower-rank activation space operators will be found. So translation to inference-time isn't clear.

farhanhubble 2 days ago | parent [-]

My understanding differs and I might be wrong. Here's what I inferred:

Let's say you finetune a Mistral-7B. Now, there are hundreds of other fine-tuned Mistral-7B's, which means it's easy to find the universal subspace U of the weights of all these models combined. You can then decompose the weights of your specific model using U and a coefficient matrix C specific to your model. Then you can convert any operation of the type `out=Wh` to `out=U(C*x)` Both U and C are much smaller dimension that W and so the number of matrix operations as well as the memory required is drastically lower.