▲ | atrettel 3 days ago | |
I'm not convinced that LLMs and other AI models need to train on all material available. A representative sample is better. I'll ignore the legality aspects in my response. I think coming up with a representative sample of all relevant information would be better in the long term (teams will not be outcompeted on long time horizons). Why don't the companies do this? Because it is easier to just "carpet bomb the parameter space" and worry about the potential confounding [1] and sampling bias [2] later. Coming up with a representative sample requires domain expertise and that is expensive in terms of time and money. But it reduces the total amount of training data and should reduce the amount of time and resources it takes to build the models. That may matter now that models are quite large. This is definitely a design decision with tradeoffs on both sides. I can entertain the notion that we don't have time to sample things, but I think we are all too often dismissing the long-term benefits of proper sampling. (In terms of the legality aspects, judges are trying to "split the baby" [3] in my opinion by saying that training on stuff you got legally is OK but training on pirated material isn't. So nobody is going to recommend training on pirated material in the first place.) [1] https://en.wikipedia.org/wiki/Confounding [2] https://en.wikipedia.org/wiki/Sampling_bias [3] https://www.404media.co/judge-rules-training-ai-on-authors-b... | ||
▲ | CamperBob2 2 days ago | parent [-] | |
Perhaps, but it seems safe to assume that the most valuable training material will be the 'illegal' material that is copyright-encumbered. |