Remix.run Logo
kouteiheika 4 days ago

The training memory breakdown is wildly inaccurate.

- No one trains big models in FP32 anymore.

- Gradients can also often be in BF16, and they don't actually have to be stored if you're not using gradient accumulation or if you're accumulating them directly in the optimizer's state.

- 32-bit Adam is silly; if you don't have infinite VRAM there's no reason why you wouldn't want to use 8-bit Adam (or you can go even lower with quantized Muon)

- Activations? They take up memory too, but are not mentioned.

It shows that to train a 3.77B parameter model I need 62GB of VRAM; just to give you some perspective for how overestimated this is: a few weeks back I was training (full fine-tuning, not LoRA) a 14B parameter model on 24GB of VRAM using every trick in the book to lower VRAM usage (to be fair, not all of those tricks are available in publicly available training harnesses, but the point still stands that even with an off-the-shelf training harness you can do a lot better than what this calculator suggests).

ethan_smith 4 days ago | parent | next [-]

Great points about training optimizations. For inference, similar dramatic memory reductions are possible through quantization (INT4/INT8) which can reduce VRAM needs by 2-8x compared to FP16, allowing much larger models on consumer GPUs.

fooker 4 days ago | parent | prev [-]

Fine tuning and training are very different beasts.

kouteiheika 4 days ago | parent [-]

No they're not? The process is essentially exactly the same, just with a much lower total FLOPs budget, since if you're not training from scratch then you don't need to train for as long. I can use *exactly* the same code that I used to fine-tune a model to train a new model from scratch; literally the only difference is whether I initialize the initial weights randomly or with an existing model, a couple of hyperparameters (e.g. for training from scratch you want to start at a higher LR), and training for longer.

fooker 4 days ago | parent [-]

No, if you try to train an LLM like you're suggesting:

- you'll get something similar to gpt2.

- To approach the scale of modern LLMs, you'll need about 10x more than all the GPUs in the world.

It's a neat abstraction to consider these the same, but do you think Meta is paying 100M for writing a 15 line script?

kouteiheika 4 days ago | parent [-]

I still don't understand what exactly you are disagreeing with.

Meta is paying the big bucks because to train a big LLM in a reasonable time you need *scale*. But the process itself is the same as full fine-tuning, just scaled up across many GPUs. If I would be patient enough to wait a few years/decades for my single GPU to chug through 15 trillion tokens then I could too train a Llama from scratch (assuming I feed it the same training data).

fooker 3 days ago | parent [-]

> you need scale.

No, training state of the art LLMs is still a bit of alchemy.

We don't understand what works and what doesn't. Meta is paying 100M each to hire AI researchers not because they know how to scale (they aren't bringing GPUs lol), but mainly because they remember what worked and what didn't for training GPT4.

> If I would be patient..

No, you'd spend the time and resources training and end up with something worse than even GPT3.

This is what made Deepseek appear in headlines for two months straight. Plenty of other companies have 100x more resources and are actively trying to have their own LLMs. Including big names like Apple and Oracle. They haven't managed to.