Isn't the new trend to train in lower precision anyway?

neilmovva 3 days ago | parent | next [-]

Today, training in "low precision" probably means computing FP8 x FP8 -> FP32. The FP32 accumulation is still important, but otherwise yes this works, especially if we're talking about MXFP8 as supported on Blackwell [0].

What's less proven is a recipe using MXFP4 x MXFP4 -> FP32 compute, e.g. [1], which needs more involved techniques to work. But if you get it to work stably, that pathway is running at full throughput on 5090.

[0]: https://arxiv.org/abs/2506.08027 [1]: https://arxiv.org/abs/2502.20586

	▲	laidoffamazon 3 days ago \| parent [-]
		Interesting. My assumption was one of the innovations of DeepSeek and the modern GPT models was performing low precision pretraining rather than just finetuning further. I didn't realize you still need accumulation at a higher precision anyway

▲

storus 3 days ago | parent | prev [-]

Only GPU-poors run Q-GaLore and similar tricks.

▲

Twirrim 3 days ago | parent [-]

Even the large cloud AI services are focusing on this too, because it drives down the average "cost per query", or whatever you want to call it. For inference, arguably more even than training, the smaller and more efficient they can get it, the better their bottom line.

	▲	storus 3 days ago \| parent [-]
		For inference of course; the OP I replied to mentioned training though.