Dispersion loss counteracts embedding condensation in small language models

aetherspawn 39 minutes ago | parent | next [-]

It makes sense to me that distributing across more parameters results in models that can be quant more heavily (information theory - more bits available)

I wonder if anyone has figured out how the information is compressed and calculated the amount of information an LLM can hold depending on its size

▲

lwansbrough 41 minutes ago | parent | prev [-]

Anyone with a billion dollars want to try this and report back?

	▲	nullc 26 minutes ago \| parent [-]
		From the paper it appears that it's probably more useful on small-ish models.