Surely there's AI usage that's not morally reprehensible.

Models that are trained only on public domain material. For value add usage, not simply marketing or gamification gimmicks...

How many models are only trained on legal[0] data? Adobe's Firefly model is one commercial model I can think of.

[0] I think the data can be licensed, and not just public domain; e.g. if the creators are suitably compensated for their data to be ingested

> How many models are only trained on legal[0] data?

None, since 'legal' for AI training is not yet defined, but Olma is trained on the Dolma 3 dataset, which is

1. Common crawl

2. Github

3. Wikipedia, Wikibooks

4. Reddit (pre-2023)

5. Semantic Scholar

6. Project Gutenberg

Nice, I hadn't heard of this. For convenience, here are HuggingFace models trained on Olma:

	▲	17 minutes ago \| parent [-]
		[deleted]