Remix.run Logo
andy99 5 hours ago

The chinchilla paper says the “optimal” training data set size is about 20x the number of parameters (in tokens), see table 3: https://arxiv.org/pdf/2203.15556

Here they do 80B tokens for a 4B model.

EvgeniyZh 44 minutes ago | parent [-]

It's worth noting that this is "compute-bound optimal", i.e., given fixed compute, the optimal choice is 20:1.

Under Chinchilla model the larger model always performs better than the small one if trained on the same amount of data. I'm not sure if it is true empirically, and probably 1-10B is a good guess for how large the model trained on 80B tokens should be.

Similarly, the small models continue to improve beyond 20:1 ratio, and current models are trained on much more data. You could train a better performing model using the same compute, but it would be larger which is not always desirable.