Remix.run Logo
vlade11115 3 days ago

Also, they provide a torrents list that anyone can seed and be part of the long-term preservation.

https://annas-archive.org/torrents

aniviacat 3 days ago | parent | next [-]

I'm surprised i2p torrents are still not popular enough to be offered as an option by sites like this.

I'd assume there are many people who don't help out purely because of legal fears, something i2p could help with.

gylterud 3 days ago | parent | next [-]

What is the status on I2P these days? I used to run a lot of stuff on it. It was a lot of fun. It was like this cozy alternative development of internet, where things still felt like 1997.

Sarky 3 days ago | parent [-]

it works and is getting better. still no mainstream adoption though.

gylterud 2 days ago | parent [-]

Cool. I might go back to it sometime some time! I really liked the routing protocol and the ease of setting up services for it.

6jQhWNYh 3 days ago | parent | prev [-]

I2P's major drawback when torrenting is speed. Assuming a speed of 500 kbps, it would take 2,000 days to download a 10 TB torrent.

vidyesh 3 days ago | parent | prev | next [-]

The numbers are interesting and a bit surprising to me.

I remember a time when people would have seedboxes for private trackers, data hoarders brag about having TBs of storage and yet only a handful of people are seeding the complete collection(s). I understand not everyone has or can seed multiple TBs of data but I was expecting there to be a lot of seeders for torrents with few hundreds of GBs.

mk_stjames 3 days ago | parent | prev [-]

Interesting to see that sci-hub is about 90TB and libgen-non-fiction is 77.5TB. To me, these are the two archives that really need protecting because this is the bulk of scientific knowledge - papers and textbooks.

I keep about 16TB of personal storage space in a home server (spread over 4 spinning disks). The idea of expanding to ~200 TB however seems... intimidating. You're looking at ~qty 12 16TB disks (not counting any for redundancy). Going the refurbished enterprise SATA drive route that is still going to run you about $180/drive = $2200 in drives.

I'm not quite there as far as disposable income to throw, but, I know many people out there who are; doubling that cost for redundancy and throw in a bit for the server hardware - $5k, to keep a current cache of all our written scientific knowledge - seems reasonable.

The interesting thing is these storage sizes aren't really growing. Scihub stopped updating the papers in 2022? At honestly with the advent of slop publications since then, the importance of what is in that 170TB is likely to remain the most important portion of the contrib for a long time.

jasonfarnon 3 days ago | parent | next [-]

"Scihub stopped updating the papers in 2022"

True but it matters a lot less in many fields because things have been moving to arXiv and other open access options, anyway. The main time I need sci-hub is for older articles. And that's a huge advantage of sci-hub--they have things like old foreign journal articles even the best academic libraries don't have.

As for mirroring it all, $2200 is beyond my budget too, but it would be nothing for a lot of academic departments, if the line item could be "characterized" the right way. To me it has been a bit of a nuisance working with libgen down the last couple months, like the post mentioned, and I would have loved for a local copy. I don't see it happening, but if libgen/sci-hub/annas archive goes the way of napster/scour, many academics would be in a serious fix.

account42 3 days ago | parent | prev | next [-]

It's 167.5, not ~200, and you can get disks much larger than 16 TB these days - a quick check shows 30 TB being sold in normal consumer stores although ~20 TB disks may still be more affordable per byte.

bawolff 3 days ago | parent | prev [-]

A lot of these are (relatively large) pdfs, right?

I wonder how much space it is as highly compressed, deduplicated, plain text files.

Does the sum of human scientific knowledge fit on a large hard drive?

mk_stjames 3 days ago | parent | next [-]

In text form only (no charts, plots, etc)- yes, pretty much all published 'science' (by that I mean something that appeared in a mass publication - paper, book, etc, not simply notes in people's notebooks) in the last 400 years likely fits into 20TB or so if converted completely to ASCII text and everything else is left out. Text is tiny.

The problem is it's not all text, you need the images, the plots, etc, and smartly, interstitially compressing the old stuff is still a very difficult problem even in this age of AI.

I have an archive of about 8TB of mechanical and aerospace papers dating back to the 1930s, and the biggest of them are usually scanned in documents, especially stuff from the 1960s and 70s, that have lots of charts and tables that take up a considerable amount of space, even in black and white only, due to how badly old scans compress (noise on paper prints, scanned in, just doesn't compress). Also many of those journals have the text compressed well, but they have a single, color, HUGE cover image as the first page of the PDF, that turns the PDF from 2MB into 20MB. Things like that could, maybe, be omitted to save space...

But as time goes on I start to become more against space-saving via truncation of those kind of scanned documents. My reasoning is that storage is getting cheaper and cheaper, and at some point the cost to store and retrieve those 80-90MB PDF's that are essentially total page by page image scans is going to be completely negligible. And I think you lose something be taking those papers and taking the covers out, or OCR'ing the typed pages and re-typesetting them to unicode (de-rasterize the scan), even when done perfectly (and when not done perfectly, you get horrible mistakes in things like equations, especially). I think we need to preserve everything to a quality level that is nearly as high as can be.

bawolff 3 days ago | parent [-]

> In text form only (no charts, plots, etc)- yes, pretty much all published 'science' (by that I mean something that appeared in a mass publication - paper, book, etc, not simply notes in people's notebooks) in the last 400 years likely fits into 20TB or so if converted completely to ASCII text and everything else is left out. Text is tiny.

20 TB uncompresssed text is roughly 6TB compressed.

I just find it crazy that for about $100 i can buy an external hard drive that would fit in my pocket that can in theory carry around the bulk of humanity's collected knowledge.

What a time to be alive. Imagine telling someone this 100 years ago. Hell, imagine telling someone this 20 years ago.

polytely 2 days ago | parent | prev [-]

there is post of Anna's archive blog about exactly that, we basically have to hold on until (open source) OCR solutions are good enough and then it suddenly starts to become feasible to have all the world's published knowledge on your computer