| ▲ | simonw 2 days ago | |
Some of the models DO reveal the data, and it's still built on "stolen work" in that it's unlicensed scrapes of the Web. Here's an example: https://huggingface.co/allenai/OLMo-2-0325-32B Here's one of their training mixes: https://huggingface.co/datasets/allenai/dolma3_pool - which includes 8 trillion tokens from Common Crawl. | ||