| ▲ | farhanhubble 2 days ago | |||||||
It might we worth it to use that subset to initialize the weights of future models but more importantly you could save a huge number of computational cycles by using the lower dimensional weights at the time of inference. | ||||||||
| ▲ | westoncb 2 days ago | parent [-] | |||||||
Ah interesting, I missed that possibility. Digging a little more though my understanding is that what's universal is a shared basis in weight space, and particular models of same architecture can express their specific weights via coefficients in a lower-dimensional subspace using that universal basis (so we get weight compression, simplified param search). But it also sounds like to what extent there will be gains during inference is in the air? Key point being: the parameters might be picked off a lower dimensional manifold (in weight space), but this doesn't imply that lower-rank activation space operators will be found. So translation to inference-time isn't clear. | ||||||||
| ||||||||