| ▲ | rogerrogerr 2 days ago | |
They’ll never reveal the data, because that would reveal this is all built on stolen work. | ||
| ▲ | simonw 2 days ago | parent [-] | |
Some of the models DO reveal the data, and it's still built on "stolen work" in that it's unlicensed scrapes of the Web. Here's an example: https://huggingface.co/allenai/OLMo-2-0325-32B Here's one of their training mixes: https://huggingface.co/datasets/allenai/dolma3_pool - which includes 8 trillion tokens from Common Crawl. | ||