| ▲ | Oras 8 hours ago | ||||||||||||||||||||||
I got excited by reading the article about releasing the training data, went to their HF account to look at the data (dolma3) and first rows? Text scraped from porn websites! | |||||||||||||||||||||||
| ▲ | andy99 6 hours ago | parent | next [-] | ||||||||||||||||||||||
Isn’t this before any curation has happened? I looked at it, I can see why it looks bad, but if they’re really being open about the whole pipeline, they have to include everything. Giving them a hard time for it only promotes keeping models closed. That said I like to think of it was my dataset I would have shuffled that part down in the list so it didn’t show up on the hf preview | |||||||||||||||||||||||
| |||||||||||||||||||||||
| ▲ | logicchains 8 hours ago | parent | prev [-] | ||||||||||||||||||||||
Erotic fiction is one of the main use cases of such models. | |||||||||||||||||||||||