Remix.run Logo
gpjt 2 days ago

OK, early indicators support both you and Gemini quite strongly re: batch size. On my (somewhat ad-hoc) test dataset, I get losses like this:

  * OpenAI medium weights: 3.231
  * OpenAI small weights: 3.500
  * My locally trained model, FineWeb Chinchilla, batch size 6: 3.944
  * My locally trained model, FineWeb-Edu Chinchilla, batch size 6: 4.167
  * My locally trained model, FineWeb-Edu double Chinchilla, batch size 6: 4.135
  * My cloud trained model, FineWeb Chinchilla, batch size 13 \* 8 = 104: 3.674
That last one was trained on an 8x A100 machine with 40 GiB per GPU, with the same code as before, just converted to DDP. It certainly looks like the much larger batch size has improved the model significantly.

I'll be trying on larger machines. No gradient accumulation yet, but it's certainly looking like a valuable lever to pull for local training runs (and, I suspect, might also be useful on "small" cloud machines like the one I used -- will have to see what things look like with the bigger mini-batches I can squeeze onto 80 GiB and 160 GiB GPUs).

spi 2 days ago | parent [-]

Thanks, very nice to see these results! Certainly using GPUs with more RAM makes things simpler to scale. Gradient accumulation is as easy as adding a counter for number of steps and an "if counter % gradient_accumulation_steps:` around `optimizer.step()`, so that can also be tried simply on a single GPU / cheaper GPUs. But if you can just use 8xA100 and your pipeline parallizes well, you also get results (almost) 8 times faster, which is certainly nicer to experiment of course!

gpjt 2 days ago | parent [-]

Exactly! If I can get it down to an hour or two (seems very plausible on an 8x H200 with 160 GiB VRAM per GPU, though those are almost never available on Lambda Labs), I'll do the experiments with dropout and the other possible causes of issues, then see if I can bake that all into a new train on the RTX 3090 and confirm it repros there. Looks like I'll definitely need gradient accumulation there.

I assume the zero_grad would need to go in the same if block?

gpjt a day ago | parent [-]

Hmm, interesting. With a batch size of 512 (8x B200s with 160 GiB each) I get worse results! Maybe there's a sweet spot somewhere in between.