| ▲ | westoncb 2 days ago | |
Ah interesting, I missed that possibility. Digging a little more though my understanding is that what's universal is a shared basis in weight space, and particular models of same architecture can express their specific weights via coefficients in a lower-dimensional subspace using that universal basis (so we get weight compression, simplified param search). But it also sounds like to what extent there will be gains during inference is in the air? Key point being: the parameters might be picked off a lower dimensional manifold (in weight space), but this doesn't imply that lower-rank activation space operators will be found. So translation to inference-time isn't clear. | ||
| ▲ | farhanhubble 2 days ago | parent [-] | |
My understanding differs and I might be wrong. Here's what I inferred: Let's say you finetune a Mistral-7B. Now, there are hundreds of other fine-tuned Mistral-7B's, which means it's easy to find the universal subspace U of the weights of all these models combined. You can then decompose the weights of your specific model using U and a coefficient matrix C specific to your model. Then you can convert any operation of the type `out=Wh` to `out=U(C*x)` Both U and C are much smaller dimension that W and so the number of matrix operations as well as the memory required is drastically lower. | ||