| ▲ | muricula 2 days ago | |
I've played with something similar with my M1 using Apple's MLX framework. The problem is I'm compute bound. I've never managed to get my M1 Max's GPU to process more than ~7.8k tokens per second at bf16 precision, so to train a 112M parameter model on ~20 billion tokens I'd need to run the model training for ~30 days. One solution is to reduce the scope of the problem -- you can train on a smaller less diverse dataset such as TinyStories which is a collection of 1 billion tokens of chatGPT generated children's stories. After about 40 hours, less than one weekend, you'll have a model which can generate mostly grammatical children's stories. If you have a newer mac and/or an ultra chip you'll have more and faster GPU cores, and might be able to train on FineWeb or a similar, larger and more diverse dataset. | ||
| ▲ | gpjt 2 days ago | parent [-] | |
OP here -- with a 112M model you should be able to get something worth playing with using 2.24B tokens. The Chinchilla heuristic is tokens = 20 x parameters. Obviously you cam get a better result by grinding through more tokens, but it will be very slow progress. It's worth noting that Andrej Karpathy is using the 20x thing for his nanochat project. I try to explain the Chinchilla paper in the post, but your favourite AI should be able to explain it well, and has the benefit that you can ask follow-up questions. | ||