Remix.run Logo
jerf 10 hours ago

This document seems to treat "data" as a fungible commodity. Perhaps our use of the word encourages that. But it's not.

How valuable is 70 petabytes of temperature sensor readings to a commercial LLM? It is in fact negative. You don't want to be training the LLM on that data. You've only got so much room in those neurons and we don't need it consumed with trying to predict temperature data series.

We don't need "more data", we need "more data of the specific types we're training on". That is not so readily available.

Although it doesn't really matter anyhow. The ideas in the document are utterly impractical. Nobody is going to label the world's data with a super-complex permission scheme any more than the world created the Semantic Web by labeling the world's data with rich metadata and cross-linking. But especially since it would be of negative value to AI training anyhow.

williamtrask 10 hours ago | parent [-]

(OP here) I agree with this in spirit, but also it's hard to imagine the world can be fully described with 200 terabytes of data. There's a lot more good stuff out there.

But to your point, a crucial question in AI right now is: how much quality data is still out there?

As far as the impracticality, it's a great point. I disagree and have spent about 10 years working in the area. But that can be a post for another day. I understand and appreciate the skepticism.

lxgr 10 hours ago | parent [-]

> it's hard to imagine the world can be fully described with 200 terabytes of data

Why? Intelligence and compression might just be two sides of the same coin, and given that, I'd actually be very surprised if a future ASI couldn't make due with a fraction of that.

Just because current LLMs need tons of data doesn't mean that that's somehow an inherent requirement. Biological lifeforms seem to be able to train/develop general intelligence from much, much less.

williamtrask 9 hours ago | parent [-]

Well, we're opining about a statement about the world. Is the universe only 200 terabytes of information?

"Biological lifeforms seem to be able to train/develop general intelligence from much, much less."

This statement is hard to defend. The brain takes in 125 MB / second, and lives for 80 years, taking in about 300+ petabytes over our lifetime.

But that's not the real kicker. It's pretty unfair to say that humans learn everything they know from birth -> death. A lot of that learning bias was worked out through evolution... which takes that 300+ petabytes and multiplies it by... many lifetimes.

lxgr 9 hours ago | parent [-]

> A lot of that learning bias was worked out through evolution... which takes that 300+ petabytes and multiplies it by... many lifetimes.

That also seems several orders of magnitude off. Would you suspect that a human that only experiences life through H.264-compressing glasses, MP3-recompressing headphones etc. does not develop a coherent world model?

What about a human only experiencing a high fidelity 3D rendering of the world based on an accurate physics simulation?

The claim that humans need petabytes of data to develop their mind seems completely indefensible to me.

> A lot of that learning bias was worked out through evolution... which takes that 300+ petabytes and multiplies it by... many lifetimes.

Isn't that like saying that you only need the right data? In which case I'd completely agree :)

williamtrask 8 hours ago | parent [-]

"The claim that humans need petabytes of data to develop their mind seems completely indefensible to me."

And yet every human you know is using petabytes of data to develop their mind. :)