How is the research on training these models directly in their quantized state going?

That'll be the real game changer.

sigmoid10 2 hours ago | parent | next [-]

The original BitNet was natively trained on 1.58 bits. PrismML has not released any actual info on how they trained, but since they are based on Qwen, there was certainly some downstream quantization involved.

	▲	usrusr 40 minutes ago \| parent [-]
		Is it just quantization or is it also rearranging the weights to get clusters with (almost) the same factors? If it's the latter it would very much be training in full precision (but also hardly any precision lost by the compression). Unfortunately my mental model doesn't contain anything to even guess if that's possible or not, my AI times were at the falling flank of symbolic. Funny how one bit models feel a bit like approaching an approximation of symbolic again (until you read about the grouped scale factors and then the illusion is gone) One thought that suggests rearranging is not involved,a thought that does not require any knowledge at all: if it did involve rearranging, someone would certainly have added some order by scale factor tricks with linear interpolation by address offset to lose even less precision.

▲

cubefox 44 minutes ago | parent | prev [-]

This is the only paper which really does this:

https://proceedings.neurips.cc/paper_files/paper/2024/hash/7...

They train directly in the 1 bit domain, without any floating point weights. They don't use the classical Newton-Leibniz derivative (which operates on approximations of real numbers) for gradient descent / backpropagation. Instead they invented a binary version called "Boolean variation".

I don't know why this paper didn't get more attention.