| ▲ | Eisenstein 3 hours ago | |||||||
> How many models are only trained on legal[0] data? None, since 'legal' for AI training is not yet defined, but Olma is trained on the Dolma 3 dataset, which is 1. Common crawl 2. Github 3. Wikipedia, Wikibooks 4. Reddit (pre-2023) 5. Semantic Scholar 6. Project Gutenberg | ||||||||
| ▲ | austinjp 2 hours ago | parent [-] | |||||||
Nice, I hadn't heard of this. For convenience, here are HuggingFace models trained on Olma: | ||||||||
| ||||||||