Remix.run Logo
Oras 8 hours ago

I got excited by reading the article about releasing the training data, went to their HF account to look at the data (dolma3) and first rows? Text scraped from porn websites!

https://huggingface.co/datasets/allenai/dolma3

andy99 6 hours ago | parent | next [-]

Isn’t this before any curation has happened? I looked at it, I can see why it looks bad, but if they’re really being open about the whole pipeline, they have to include everything. Giving them a hard time for it only promotes keeping models closed.

That said I like to think of it was my dataset I would have shuffled that part down in the list so it didn’t show up on the hf preview

Oras 6 hours ago | parent [-]

Hard time? What value does adult videos description, views and comments add to small (7,32B) models?

andy99 6 hours ago | parent | next [-]

It says it’s common crawl, I interpret it to mean this is a generic web scrape dataset, presumably they filter stuff out they don’t want before pretraining. You’d have to do do some ablation testing to know what value it adds

khimaros 4 hours ago | parent | prev [-]

what if that's where they learned how to utilize the double entendre? hard times indeed.

logicchains 8 hours ago | parent | prev [-]

Erotic fiction is one of the main use cases of such models.