▲ | laidoffamazon 3 days ago | ||||||||||||||||
Isn't the new trend to train in lower precision anyway? | |||||||||||||||||
▲ | neilmovva 3 days ago | parent | next [-] | ||||||||||||||||
Today, training in "low precision" probably means computing FP8 x FP8 -> FP32. The FP32 accumulation is still important, but otherwise yes this works, especially if we're talking about MXFP8 as supported on Blackwell [0]. What's less proven is a recipe using MXFP4 x MXFP4 -> FP32 compute, e.g. [1], which needs more involved techniques to work. But if you get it to work stably, that pathway is running at full throughput on 5090. [0]: https://arxiv.org/abs/2506.08027 [1]: https://arxiv.org/abs/2502.20586 | |||||||||||||||||
| |||||||||||||||||
▲ | storus 3 days ago | parent | prev [-] | ||||||||||||||||
Only GPU-poors run Q-GaLore and similar tricks. | |||||||||||||||||
|