This sounds like a fascinating niche piece of technical expertise I would love to hear more about.

What are the biggest challenges in scaling metadata from a trillion to a quadrillion objects?

jandrewrogers 3 days ago | parent | next [-]

It is dependent on the intended workload but there are a few common design problems. Keep in mind that you can't just deal in the average case, you have to design for the worst possible cases of extremely skewed or pathologically biased distributions. A lot of the design work is proving worst case resource bounds under various scenarios and then proving the worst case behavior of designs intended to mitigate that.

An obvious one is bulk deletion, which is rarely fast at any scale. This may involve trillions of updates to search indexing structures, which in naive implementations could look like pointer-chasing across disk. Releasing storage to allocators has no locality because you are streaming the allocations to release off that storage in semi-random order. It is unhelpfully resistant to most scheduling-based locality optimization techniques. You also want to parallelize this as much as possible and some of these allocators will be global-ish.

The most interesting challenge to me is meta-scheduling. Cache replacement algorithms usually don't provide I/O locality at this scale so standard mitigations for cache-resistant workloads like dynamic schedule rewriting and latency-hiding are used instead. Schedulers are in the center of the hot path so you really want these to be memory-resident and fast. Their state size is loosely correlated with the number of objects, so in some extreme cases these can easily exceed available memory on large servers. You can address this by designing a "meta-scheduler" that adaptively optimizes the scheduling of scheduler state, so that the right bits are memory-resident at the right time so that the scheduler can optimally schedule its workload. It is difficult to overstate how much of a breaking change to conventional architecture this turns out to be. These add some value even if the state is memory resident but they greatly increase design complexity and make tail latencies more difficult to manage.

A more basic challenge is that you start dealing with numbers that may not be representable in 64-bits. Similarly, many popular probabilistic algorithms may not offer what you need when the number of entities is this large.

I aggressively skirted these issues for a long time before relenting. I deal more with database storage engines than filesystems, but to a first approximation "files" and "shards" are equivalent for these purposes.

	▲	comprev 3 days ago \| parent \| next [-]
		This is quite fascinating, thank you!
	▲	stuartjohnson12 2 days ago \| parent \| prev [-]
		This is fascinating. If you wrote a long essay about this, I (and probably most of hacker news) would surely love to read it.

▲

KaiserPro 3 days ago | parent | prev [-]

you really notice metadata performance (try a git checkout on EFS on AWS. loads of small files takes fucking ages) However EFS is actually pretty fast. you can get decent throughput if you're writing to just one file. but if you're trying to open 1000 1meg files to read from vs 1 1G file, it'll be much slower (unless they'd dramatically improved performance recently)

Trying to have a fast globally consistent database for quadrillion items in the _same_ name space is super hard. You need to chose a tradeoff between speed, partition resistance and consistency.

You're much better off sharding into discreet logical units. Its very rare that you need a global namespace for a filesystem. For VFX where we used lustre a lot, the large namespace was a nice to have, it was more about getting a raid-0 across file servers (well object stores) to get performance.

For filesystems specifically, if you're using folders, then you don't actually need to guarantee much outside of a folder. So long as filenames are unique to that folder, you can get away with a lot of shit you can't do in a normal database. you also don't need directories to be on the same filesystem (well in linux at least) so you can also shard by using directories as a key.

The directory-key-filesystem approach is actually hilariously simple, fast scalable and reliable. If a single server/Fs goes down it only takes out that area. On the downside it does mean that you can overwhelm/get hot spots.

▲

dekhn 3 days ago | parent [-]

We are truly spoiled by all the improvements that went into local filesystems that are lacking in network filesystems. So much of our perception of "computer is fast" is really just write-caching, read-caching, read-ahead.

	▲	KaiserPro 2 days ago \| parent [-]
		Oh nvme and commodity 10/40/100gig networks mean that NFS shares can be _faster_ than local disk In 2008 when I was a youngen, 100tb filesystem that could sustain 1-3gigabytes of streaming throughput took something like 40 racks. Huge amounts of cost and power were needed to set it up and maintain it. Any kind of random IO would kneecap the performance for everyone Now you can have a 2u server with 100tb of NVME storage and the only bottleneck is the network adaptor! not only that but its pretty cheap too.