I don't really understand this. Is it really that costly to keep the entire database if they're going to keep part of it?

▲

tombert 3 days ago | parent | next [-]

I built a URL shortener years ago for fun. I don't have the resources that Google has, but I just hacked it together in Erlang using Riak KV and it did horizontally scale across at least three computer (I didn't have more at the time).

Unless I'm just super smart (I'm not), it's pretty easy to write a URL shortener as a key-value system, and pure key-value stuff is pretty easy to scale. I cannot imagine that isn't doing something as or more efficient than what I did.

▲

wtallis 3 days ago | parent [-]

Google also has the advantages that they now only need a read-only key-value store, and they know the frequency distribution for lookups. This is now the kind of problem many programmers would be happy to spend a weekend optimizing to get an average lookup time down to tens of nanoseconds.

	▲	tombert 3 days ago \| parent [-]
		I don't think it would even cost me very much to host all these links on a GCP or AWS thing, I don't think more than a couple hundred dollars a year. Obviously raw server costs aren't the only costs associated with something like this, you'd still need to pay software people to keep it on life support, but considering how simple URL shorteners are to implement, I still don't think it would be that expensive. ETA: I should point out, even something kind of half-assed could be built with Cloud Functions and BigTable really easily; this wouldn't win any kind of contests for low latency, but it would be exceedingly simple code and have sufficient uptime guarantees and would be much less likely to piss off the community. If I had any idea how to reach out to higher-ups at Google I would offer to contract and build it myself, but that's certainly not necessary, they have thousands of developers, most of which could write this themselves in an afternoon.

▲

benoau 3 days ago | parent | prev [-]

I don't understand the data on ArchiveTeam's page but, it seems like they have 35 terabytes of data (286.56TiB)? It's a lot larger than I'd have thought.

▲

wtallis 3 days ago | parent [-]

FYI, "TiB" means terabytes with a base of 1024, ie. the units you'd typically use for measuring memory rather than the units you'd typically see drive vendors using. The factor of 8 you divided by only applies to units based on bits rather than bytes, and those units use "b" rather than "B", and are only used for capacity measurements when talking about individual memory dies (though they're normal for talking about interconnect speeds).

Either way, we're talking about a dataset that fits easily in a 1U server with at most half of its SSD slots filled.

▲

jdiff 3 days ago | parent [-]

The binary units like GiB, TiB, are technically supposed to be Gibibytes and Tebibytes. Thought it was a bit silly when they first popped up but now I find them adorkably endearing, and a good way to disambiguate something that's often left vague at your expense.

▲

wtallis 3 days ago | parent [-]

In my experience, nobody actually says "Tebibytes" out loud; it's just that silly. In writing, when the precision is necessary, the abbreviation "TiB" does see some actual use.

	▲	hobs 3 days ago \| parent [-]
		If that's the unit, I am saying it, but yes - everyone gives me weird looks every time and just assumes I am mispronouncing terabytes but yet does not correct me.