Remix.run Logo
nico 3 days ago

Has anyone done something like this but with apple silicon instead of a graphics card? Training a small LLM on an M2-M5?

muricula 2 days ago | parent | next [-]

I've played with something similar with my M1 using Apple's MLX framework. The problem is I'm compute bound. I've never managed to get my M1 Max's GPU to process more than ~7.8k tokens per second at bf16 precision, so to train a 112M parameter model on ~20 billion tokens I'd need to run the model training for ~30 days.

One solution is to reduce the scope of the problem -- you can train on a smaller less diverse dataset such as TinyStories which is a collection of 1 billion tokens of chatGPT generated children's stories. After about 40 hours, less than one weekend, you'll have a model which can generate mostly grammatical children's stories.

If you have a newer mac and/or an ultra chip you'll have more and faster GPU cores, and might be able to train on FineWeb or a similar, larger and more diverse dataset.

gpjt 2 days ago | parent [-]

OP here -- with a 112M model you should be able to get something worth playing with using 2.24B tokens. The Chinchilla heuristic is tokens = 20 x parameters. Obviously you cam get a better result by grinding through more tokens, but it will be very slow progress. It's worth noting that Andrej Karpathy is using the 20x thing for his nanochat project.

I try to explain the Chinchilla paper in the post, but your favourite AI should be able to explain it well, and has the benefit that you can ask follow-up questions.

goosers 2 days ago | parent | prev [-]

I’m experimenting with this, but using the CPU not the GPU. I’m finishing up writing the series now, but focused more on understanding the architecture than trying to build a useful model. Mine requires talking in the language of Shakespeare, and getting replies in the same, a proof of concept more than a useful tool. https://www.tag1.com/white-paper/part1-tokenization-building...

I was interested in focusing on repeatability and using text sources anyone can legally obtain. It’s been fascinating, but after much experimentation it’s clear that working with more text and more diverse text would be extremely helpful.