Remix.run Logo
progval 2 hours ago

> There are research models out there which are trained on only permissively licensed data

Models whose authors tried to train only on permissively licensed data.

For example https://huggingface.co/bigcode/starcoder2-15b tried to be a permissively licensed dataset, but it filtered only on repository-level license, not file-level. So when searching for "under the terms of the GNU General Public License" on https://huggingface.co/spaces/bigcode/search-v2 back when it was working, you would find it was trained on many files with a GPL header.