| ▲ | TheCraiggers 10 hours ago | |
I'd love to see a citation there. We already know from a few years ago that they were training AI based on projects on GitHub. Meanwhile, I highly doubt software firms were lining up to have their proprietary code bases ingested by AI for training purposes. Even with NDAs, we would have heard something about it. | ||
| ▲ | xdavidliu 5 hours ago | parent [-] | |
I should have clarified what I meant. The training data includes roughly speaking the entire internet. Open source code is probably a large fraction of the code in the data, but it is a tiny fraction of the total data, which is mostly non-code. My point was that the hypothetical of "not contributing to any open source code" to the extent that LLMs had no code to train on, would not have made as big of an impact as that person thought, since a very large majority of the internet is text, not code. | ||