| ▲ | azath92 3 days ago | |
For small models this is for sure the way forward, there are some great small datasets out there (check out the tiny stories dataset that limits vocab to a certain age but keeps core reasoning inherent in even simple language https://huggingface.co/datasets/roneneldan/TinyStories https://arxiv.org/abs/2305.07759) I have less concrete examples but my understanding is that dataset curation is for sure the way many improvements are gained at any model size. Unless you are building a frontier model, you can use a better model to help curate or generate that dataset for sure. TinyStories was generated with GPT-4 for example. | ||
| ▲ | gpjt 3 days ago | parent [-] | |
OP here: one thing that surprised me in this experiment was that the model trained on the more curated FineWeb-Edu dataset was worse than the one trained on FineWeb. That is very counterintuitive to me. | ||