Remix.run Logo
rogerrogerr 2 days ago

They’ll never reveal the data, because that would reveal this is all built on stolen work.

simonw 2 days ago | parent [-]

Some of the models DO reveal the data, and it's still built on "stolen work" in that it's unlicensed scrapes of the Web. Here's an example:

https://huggingface.co/allenai/OLMo-2-0325-32B

Here's one of their training mixes: https://huggingface.co/datasets/allenai/dolma3_pool - which includes 8 trillion tokens from Common Crawl.