Remix.run Logo
Aardwolf 3 days ago

I don't understand the page, it shows a list of data sets (I think?) up to 91 TiB in size

The list of short links and their target URLs can't be 91 TiB in size can it? Does anyone know how this works?

digitaldragon 3 days ago | parent | next [-]

The data is saved as a WARC file, which contains the entire HTTP request and response (compressed, of course). So it's much bigger than just a short -> long URL mapping.

2 days ago | parent | next [-]
[deleted]
lyu07282 2 days ago | parent | prev [-]

did they follow the redirect and archive the page content? but why?

jdiff 3 days ago | parent | prev | next [-]

I did some ridiculous napkin math. A random URL I pulled from a Google search was 705 bytes. A googl link is 22 bytes but if you only store the ID, it'd be 6 bytes. Some URLs are going to be shorter, some longer, but just ballparking it all, that lands us in the neighborhood of hundreds of billions of URLs, up to trillions of URLs.

rafram 3 days ago | parent [-]

> A random URL I pulled from a Google search was 705 bytes.

705 bytes is an extremely long URL. Even if we assume that URLs that get shortened tend to be longer than URLs overall, that’s still an unrealistic average.

jdiff 2 days ago | parent [-]

It is long, it represents the lower hundreds of billions bound in my awful napkin math.

ethan_smith 2 days ago | parent | prev | next [-]

The 91 TiB includes not just the URL mappings but the actual content of all destination pages, which ArchiveTeam captures to ensure the links remain functional even if original destinations disappear.

account42 a day ago | parent [-]

Ok but the destination pages are not at risk (or at least not any more than any random page on the web) so why spend any effort crawling them before all shortcuts have been saved?

lyu07282 3 days ago | parent | prev [-]

3.75 billion URLs, according to this[1] the average URL is 76.97 characters would be ~268.8 GiB without the goo.gl id/metadata. So I also wonder whats up with that.

https://web.archive.org/web/20250125064617/http://www.superm...