Remix clone Hacker News

new | show | ask | jobs Github

▲

Havoc 2 days ago

Is there a reason why the 1.58 ones are always aimed at quite small ones? Think I’ve seen an 8B but that’s about it.

Is there a technical reason for it or just research convenience ?

▲

londons_explore 2 days ago | parent | next [-]

I suspect because current GPU hardware can't efficiently train such low bit depth models. You end up needing activations to use 8 or 16 bits in all the data paths, and don't get any more throughput per cycle on the multiplications than you would have done with FP32.

Custom silicon would solve that, but nobody wants to build custom silicon for a data format that will go out of fashion before the production run is done.

	▲	zamadatix 2 days ago \| parent \| next [-]
		The custom CUDA kernel for 4-in-8 seems to have come out better than a naive approach (such as just treating each as an fp8/int8) + it lowers memory bandwidth. Custom hardware would certainly make that improvement even better but I don't think that's what's limiting training to 2-8 billion parameters as much as something like research convenience while the groundwork for this type of model is still being figured out.
	▲	Havoc 2 days ago \| parent \| prev [-]
		Makes sense. Might be good for mem throughput constrained devices though so hoping it’ll pick up

▲

yieldcrv 2 days ago | parent | prev [-]

They aren’t, there is a 1.58 version of deepseek that’s like 200gb instead of 700

	▲	logicchains 2 days ago \| parent [-]
		That's not a real BitNet, it's just a post-training quantisation, and its performance suffers compared to if it was trained from scratch at 1.58 bits.