▲ | orbital-decay 21 hours ago | |
The reality of working with humongous datasets is they're always bootstrapped like this, in multiple steps. In LLMs in particular, the entire post-training step is always done on synthetic data. There are ways to avoid failure modes typical for that (like model collapse), you need much less real data to keep the model in check than you probably think. |