| ▲ | HighFreqAsuka 10 hours ago | |
Take a look at The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text (https://arxiv.org/pdf/2506.05209). They build a reasonable 7B parameter model using only open-licensed data. | ||
| ▲ | nickpsecurity 8 hours ago | parent [-] | |
They mostly do that. They risked legal contamination by using Whisper-derived text and web text which might have gotchas. Other than that, it was a great collection for low-risk training. | ||