▲ | kouteiheika 4 days ago | ||||||||||||||||||||||||||||||||||
The training memory breakdown is wildly inaccurate. - No one trains big models in FP32 anymore. - Gradients can also often be in BF16, and they don't actually have to be stored if you're not using gradient accumulation or if you're accumulating them directly in the optimizer's state. - 32-bit Adam is silly; if you don't have infinite VRAM there's no reason why you wouldn't want to use 8-bit Adam (or you can go even lower with quantized Muon) - Activations? They take up memory too, but are not mentioned. It shows that to train a 3.77B parameter model I need 62GB of VRAM; just to give you some perspective for how overestimated this is: a few weeks back I was training (full fine-tuning, not LoRA) a 14B parameter model on 24GB of VRAM using every trick in the book to lower VRAM usage (to be fair, not all of those tricks are available in publicly available training harnesses, but the point still stands that even with an off-the-shelf training harness you can do a lot better than what this calculator suggests). | |||||||||||||||||||||||||||||||||||
▲ | ethan_smith 4 days ago | parent | next [-] | ||||||||||||||||||||||||||||||||||
Great points about training optimizations. For inference, similar dramatic memory reductions are possible through quantization (INT4/INT8) which can reduce VRAM needs by 2-8x compared to FP16, allowing much larger models on consumer GPUs. | |||||||||||||||||||||||||||||||||||
▲ | fooker 4 days ago | parent | prev [-] | ||||||||||||||||||||||||||||||||||
Fine tuning and training are very different beasts. | |||||||||||||||||||||||||||||||||||
|