Each fine tune drags the model weights away from the base model in a certain direction.

Given 500 fine tune datasets, we could expect the 500 drag directions to span a 500 dimensional space. After all, 500 random vectors in a high dimensional space are likely to be mutually orthogonal.

The paper shows, however, that the 500 drag directions live in a ~40 dimensional subspace.

Another way to say it is that you can compress fine tune weights into a vector of 40 floats.

Imagine if, one day, fine tunes on huggingface were not measured in gigabytes, megabytes, or even kilobytes. Suppose you started to see listings like 160 bytes. Would that be surprising?

I’m leaving out the detail that the basis direction vectors themselves would have to be on your machine and each basis direction is as big as the model itself. And I’m also taking for granted that the subspace dimension will not increase as the number of fine tune datasets increases.

I agree that the authors decision to use random models on hugging face is unfortunate. I’m hopeful that this paper will inspire follow up works that train large models from scratch.

▲

mapontosevenths 2 days ago | parent [-]

Agreed. What's surprising here to me isn't that the fine tunes are compressible, it's the degree to which they're compressible. It seems like very little useful new information is being added by the fine-tune.

They're using SVD to throw away almost all of the "new information" and apparently getting solid results anyhow. Which of course raises interesting questions if replicable. The code doesn't seem to have been released yet though.

	▲	farhanhubble 2 days ago \| parent [-]
		Yeah but it also made me think if deep down neural networks are curated random basis vectors, like in random projections.