▲ | Unlocking a Million Times More Data for AI(ifp.org) | |||||||||||||||||||||||||||||||||||||||||||
27 points by williamtrask 10 hours ago | 57 comments | ||||||||||||||||||||||||||||||||||||||||||||
▲ | hunterpayne 8 hours ago | parent | next [-] | |||||||||||||||||||||||||||||||||||||||||||
"Homomorphic encryption enables the aggregation of these distributed model pieces while they are encrypted, allowing for federated learning without centralizing data." A bigger hand wave has never been done I think. Homomorphic encryption increases computational load several fold. And I'm not aware of anyone trying to use this (very interesting) technology for much of anything, let alone GPU ML algorithms. | ||||||||||||||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||||||||||||||
▲ | janice1999 9 hours ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||||||||||||||
Think-tank wants to enable companies get access to private medical and other personal data. "Solution" to privacy "problem" sounds like a blockchain pitch circa-2019. Wonderful. | ||||||||||||||||||||||||||||||||||||||||||||
▲ | destroycom 7 hours ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||||||||||||||
This doesn't seem like an article that was made with proper research or proper sincerity. The claim is that there is one million times more data to feed to LLMs, citing a few articles. The articles estimate that there is 180-200 zettabytes (the number mentioned in TFA) of data total in the world, including all cloud services, including all personal computers, etc. The vast majority of that data are not useful to train LLMs at all, they will be movies, games, databases. There is a massive amount of duplication in that data. Only a tiny-tiny fraction will be something useful. > Think of today’s AI like a giant blender where, once you put your data in, it gets mixed with everyone else’s, and you lose all control over it. This is why hospitals, banks, and research institutions often refuse to share their valuable data with AI companies, even when that data could advance critical AI capabilities. This is not the reason, the reason is that this data is private. LLMs do not just learn from data, they can often reproduce it verbatim, you cannot give medical records or bank records of real people, that will put them at a very real risk. Let alone that a lot of them will be, well-structured, yes, but completely useless information for LLM training. You will not get any improvement in the perceived "intellect" of a model by overfitting it with terabytes of tables with bank transaction records. | ||||||||||||||||||||||||||||||||||||||||||||
▲ | k310 9 hours ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||||||||||||||
What that locked (private) data entails. > What makes this vast private data uniquely valuable is its quality and real-world grounding. This data includes electronic health records, financial transactions, industrial sensor readings, proprietary research data, customer/population databases, supply chain information, and other structured, verified datasets that organizations use for operational decisions and to gain competitive advantages. Unlike web-scraped data, these datasets are continuously validated for accuracy because organizations depend on them, creating natural quality controls that make even a small fraction of this massive pool extraordinarily valuable for specialized AI applications. Will there be a data exchange where one can buy and sell data, or even commododata markets, where one can hedge/speculate on futures? Asking for a friend. | ||||||||||||||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||||||||||||||
▲ | Normal_gaussian 9 hours ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||||||||||||||
Big data's no true scotsman problem: > Despite what their name might suggest, so-called “large language models” (LLMs) are trained on relatively small datasets.1 2 3 For starters, all the aforementioned measurements are described in terms of terabytes (TBs), which is not typically a unit of measurement one uses when referring to “big data.” Big data is measured in petabytes (1,000 times larger than a terabyte), exabytes (1,000,000 times larger), and sometimes zettabytes (1,000,000,000 times larger). | ||||||||||||||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||||||||||||||
▲ | svieira 8 hours ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||||||||||||||
> What makes this vast private data uniquely valuable is its quality and real-world grounding. This is a bold assumption. After Enron (financial transactions), Lehman Brothers (customer/population databases, financial transactions), Theranos (electronic health records), Nikola (proprietary research data), Juicero (I don't even know what this is), WeWork (umm ... everything), FTX (everything and we know they didn't mind lying to themselves) I'm pretty sure we can all say for certain that "real world grounding" isn't a guarantee with regards to anything where money or ego is involved. Not to mention that at this point we're actively dealing with processes being run (improperly) by AI (see the lawsuits against Cigna and and United Health Care [1]), leading to self-training loops without revealing the "self" aspect of it. [1]: https://www.afslaw.com/perspectives/health-care-counsel-blog... | ||||||||||||||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||||||||||||||
▲ | themafia 8 hours ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||||||||||||||
It's simple. Pay them. Otherwise why on Earth should I care about "contributing to AI?" It's just another commercial venture which is trying to get something of high value for no money. A protocol that doesn't involve royalty payments is a non starter. | ||||||||||||||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||||||||||||||
▲ | runako 9 hours ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||||||||||||||
One would have to be a special kind of fool to expect honest payments from the very same organizations that are currently doing everything possible to avoid paying for the original training data they stole. | ||||||||||||||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||||||||||||||
▲ | Animats 7 hours ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||||||||||||||
Does vast amounts of lower and lower quality data help much? If you can train on the entire feeds of social media, you keep up on recent pop culture trends, but does it really make LLMs much smarter? Recent progress on useful LLMs seems to involve slimming them down.[1] Does your customer-support LLM really need a petabyte of training data? Yes, now it can discuss everything from Kant to the latest Taylor Swift concert lineup. It probably just needs enough of that to make small talk, plus comprehensive data on your own products. The future of business LLMs probably fits in a 1U server. [1] https://mljourney.com/top-10-smallest-llm-to-run-locally/ | ||||||||||||||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||||||||||||||
▲ | ttfvjktesd 8 hours ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||||||||||||||
I think one important point is missing here: more data does not automatically lead to better LLMs. If you increase the amount of data tenfold, you might only achieve a slight improvement. We already see that simply adding more and more parameters for instance does not currently make models better. Instead, progress is coming from techniques like reasoning, grounding, post-training, and reinforcement learning, which are the main focus of improvement for state-of-the-art models in 2025. | ||||||||||||||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||||||||||||||
▲ | joegibbs 4 hours ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||||||||||||||
I remember that a couple of years ago people were talking about how multimodal models would have skills bleed-over, so one that's trained on the same amount of text + a ton of video/image data would perform better on text responses. Did this end up holding up? Intuitively I would think that text packs much more meaning into the same amount of data than visuals do (a single 1000x1000px image would be about the same amount of data as a million characters), which would hamstring it. | ||||||||||||||||||||||||||||||||||||||||||||
▲ | jerf 9 hours ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||||||||||||||
This document seems to treat "data" as a fungible commodity. Perhaps our use of the word encourages that. But it's not. How valuable is 70 petabytes of temperature sensor readings to a commercial LLM? It is in fact negative. You don't want to be training the LLM on that data. You've only got so much room in those neurons and we don't need it consumed with trying to predict temperature data series. We don't need "more data", we need "more data of the specific types we're training on". That is not so readily available. Although it doesn't really matter anyhow. The ideas in the document are utterly impractical. Nobody is going to label the world's data with a super-complex permission scheme any more than the world created the Semantic Web by labeling the world's data with rich metadata and cross-linking. But especially since it would be of negative value to AI training anyhow. | ||||||||||||||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||||||||||||||
▲ | lordofgibbons 7 hours ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||||||||||||||
We don't have a data scarcity problem. Further refinement to the pretraining stage will continue to happen, but I don't expect the orders of magnitude of additional scaling to be required any longer. What's lacking is RL datasets and environments. If any more scaling scaling does happen, it will happen in the mid-training (using agentic/reasoning outputs from previous model versions) and RL training stages. | ||||||||||||||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||||||||||||||
▲ | horhay 9 hours ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||||||||||||||
Man, it's not like the wave of generative AI has showed us that these companies don't work with altruistic intentions and means. | ||||||||||||||||||||||||||||||||||||||||||||
▲ | supermatt 7 hours ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||||||||||||||
This entire article reads like some hand wavey nonsense, throwing pretty much every cutting edge AI buzzword around to solve a problem that doesnt exist. All the top models are moving towards synthetic data - not because they want more data but because they want quality data that is structured to train utility. Having zettabytes of “invisible” data is effectively pointless. You can’t train on it because there is so much of it, it’s way more expensive to train per byte because of homomorphic magic (if it’s even possible), and most importantly - it’s not quality training data! | ||||||||||||||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||||||||||||||
▲ | JackYoustra 8 hours ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||||||||||||||
I'm a bit worried - there could be idiosyncratic links that these models learn that causes deanonymization. Ideally you could just add a forget loss to prevent this... but how do you add a forget loss if you don't have all of the precise data necessary for such a term? | ||||||||||||||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||||||||||||||
▲ | 9 hours ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||||||||||||||
[deleted] | ||||||||||||||||||||||||||||||||||||||||||||
▲ | catigula 8 hours ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||||||||||||||
If only laws and having to respect people and privacy didn't exist, then we could build our machine God and I could maybe (but probably not) live forever! | ||||||||||||||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||||||||||||||
▲ | 01HNNWZ0MV43FF 8 hours ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||||||||||||||
Sounds good. I am going to make a blank model, train it homomorphically to predict someone's name based on their butt cancer status, then prompt it to generate a list of people's names who have butt cancer, and blackmail them not to send it to their employers. | ||||||||||||||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||||||||||||||
▲ | squigz 8 hours ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||||||||||||||
I hope the author considers the morality of advocating for health records and financial transactions - and probably every other bit of private data we might have - to be openly available to companies. I have a better idea: let's just cut the middlemen out and send every bit of data every computer generates to OpenAI. Sorry, to be fair, they want this to be a government-led operation... I'm sure that'll be fine too. | ||||||||||||||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||||||||||||||
▲ | aaroninsf 8 hours ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||||||||||||||
I literally laughed out loud when I got to the modest proposal. | ||||||||||||||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||||||||||||||
▲ | ajjahs 8 hours ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||||||||||||||
[dead] | ||||||||||||||||||||||||||||||||||||||||||||
▲ | techlatest_net 8 hours ago | parent | prev [-] | |||||||||||||||||||||||||||||||||||||||||||
[dead] |