| ▲ | EvgeniyZh 2 hours ago | |
It's worth noting that this is "compute-bound optimal", i.e., given fixed compute, the optimal choice is 20:1. Under Chinchilla model the larger model always performs better than the small one if trained on the same amount of data. I'm not sure if it is true empirically, and probably 1-10B is a good guess for how large the model trained on 80B tokens should be. Similarly, the small models continue to improve beyond 20:1 ratio, and current models are trained on much more data. You could train a better performing model using the same compute, but it would be larger which is not always desirable. | ||