▲ | Aardwolf 3 days ago | ||||||||||||||||
I don't understand the page, it shows a list of data sets (I think?) up to 91 TiB in size The list of short links and their target URLs can't be 91 TiB in size can it? Does anyone know how this works? | |||||||||||||||||
▲ | digitaldragon 3 days ago | parent | next [-] | ||||||||||||||||
The data is saved as a WARC file, which contains the entire HTTP request and response (compressed, of course). So it's much bigger than just a short -> long URL mapping. | |||||||||||||||||
| |||||||||||||||||
▲ | jdiff 3 days ago | parent | prev | next [-] | ||||||||||||||||
I did some ridiculous napkin math. A random URL I pulled from a Google search was 705 bytes. A googl link is 22 bytes but if you only store the ID, it'd be 6 bytes. Some URLs are going to be shorter, some longer, but just ballparking it all, that lands us in the neighborhood of hundreds of billions of URLs, up to trillions of URLs. | |||||||||||||||||
| |||||||||||||||||
▲ | ethan_smith 2 days ago | parent | prev | next [-] | ||||||||||||||||
The 91 TiB includes not just the URL mappings but the actual content of all destination pages, which ArchiveTeam captures to ensure the links remain functional even if original destinations disappear. | |||||||||||||||||
| |||||||||||||||||
▲ | lyu07282 3 days ago | parent | prev [-] | ||||||||||||||||
3.75 billion URLs, according to this[1] the average URL is 76.97 characters would be ~268.8 GiB without the goo.gl id/metadata. So I also wonder whats up with that. https://web.archive.org/web/20250125064617/http://www.superm... |