| ▲ | xdavidliu 5 hours ago | |
I should have clarified what I meant. The training data includes roughly speaking the entire internet. Open source code is probably a large fraction of the code in the data, but it is a tiny fraction of the total data, which is mostly non-code. My point was that the hypothetical of "not contributing to any open source code" to the extent that LLMs had no code to train on, would not have made as big of an impact as that person thought, since a very large majority of the internet is text, not code. | ||