Remix.run Logo
mk_stjames 3 days ago

In text form only (no charts, plots, etc)- yes, pretty much all published 'science' (by that I mean something that appeared in a mass publication - paper, book, etc, not simply notes in people's notebooks) in the last 400 years likely fits into 20TB or so if converted completely to ASCII text and everything else is left out. Text is tiny.

The problem is it's not all text, you need the images, the plots, etc, and smartly, interstitially compressing the old stuff is still a very difficult problem even in this age of AI.

I have an archive of about 8TB of mechanical and aerospace papers dating back to the 1930s, and the biggest of them are usually scanned in documents, especially stuff from the 1960s and 70s, that have lots of charts and tables that take up a considerable amount of space, even in black and white only, due to how badly old scans compress (noise on paper prints, scanned in, just doesn't compress). Also many of those journals have the text compressed well, but they have a single, color, HUGE cover image as the first page of the PDF, that turns the PDF from 2MB into 20MB. Things like that could, maybe, be omitted to save space...

But as time goes on I start to become more against space-saving via truncation of those kind of scanned documents. My reasoning is that storage is getting cheaper and cheaper, and at some point the cost to store and retrieve those 80-90MB PDF's that are essentially total page by page image scans is going to be completely negligible. And I think you lose something be taking those papers and taking the covers out, or OCR'ing the typed pages and re-typesetting them to unicode (de-rasterize the scan), even when done perfectly (and when not done perfectly, you get horrible mistakes in things like equations, especially). I think we need to preserve everything to a quality level that is nearly as high as can be.

bawolff 3 days ago | parent [-]

> In text form only (no charts, plots, etc)- yes, pretty much all published 'science' (by that I mean something that appeared in a mass publication - paper, book, etc, not simply notes in people's notebooks) in the last 400 years likely fits into 20TB or so if converted completely to ASCII text and everything else is left out. Text is tiny.

20 TB uncompresssed text is roughly 6TB compressed.

I just find it crazy that for about $100 i can buy an external hard drive that would fit in my pocket that can in theory carry around the bulk of humanity's collected knowledge.

What a time to be alive. Imagine telling someone this 100 years ago. Hell, imagine telling someone this 20 years ago.