▲ | jerf 10 hours ago | ||||||||||||||||||||||||||||||||||
This document seems to treat "data" as a fungible commodity. Perhaps our use of the word encourages that. But it's not. How valuable is 70 petabytes of temperature sensor readings to a commercial LLM? It is in fact negative. You don't want to be training the LLM on that data. You've only got so much room in those neurons and we don't need it consumed with trying to predict temperature data series. We don't need "more data", we need "more data of the specific types we're training on". That is not so readily available. Although it doesn't really matter anyhow. The ideas in the document are utterly impractical. Nobody is going to label the world's data with a super-complex permission scheme any more than the world created the Semantic Web by labeling the world's data with rich metadata and cross-linking. But especially since it would be of negative value to AI training anyhow. | |||||||||||||||||||||||||||||||||||
▲ | williamtrask 10 hours ago | parent [-] | ||||||||||||||||||||||||||||||||||
(OP here) I agree with this in spirit, but also it's hard to imagine the world can be fully described with 200 terabytes of data. There's a lot more good stuff out there. But to your point, a crucial question in AI right now is: how much quality data is still out there? As far as the impracticality, it's a great point. I disagree and have spent about 10 years working in the area. But that can be a post for another day. I understand and appreciate the skepticism. | |||||||||||||||||||||||||||||||||||
|