| ▲ | qingcharles 4 hours ago | ||||||||||||||||
How many models are only trained on legal[0] data? Adobe's Firefly model is one commercial model I can think of. [0] I think the data can be licensed, and not just public domain; e.g. if the creators are suitably compensated for their data to be ingested | |||||||||||||||||
| ▲ | Eisenstein 3 hours ago | parent [-] | ||||||||||||||||
> How many models are only trained on legal[0] data? None, since 'legal' for AI training is not yet defined, but Olma is trained on the Dolma 3 dataset, which is 1. Common crawl 2. Github 3. Wikipedia, Wikibooks 4. Reddit (pre-2023) 5. Semantic Scholar 6. Project Gutenberg | |||||||||||||||||
| |||||||||||||||||