I have worked on exabyte-scale storage engines. There is a good engineering reason for this type of limitation.

If you had 1 KiB average file size then you have quadrillions of metadata objects to quickly search and manage with fine-granularity. The kinds of operations and coordination you need to do with metadata is difficult to achieve reliably when the metadata structure itself is many PB in size. There are interesting edge cases that show up when you have to do deep paging of this metadata off of storage. Making this not slow requires unorthodox and unusual design choices that introduce a lot of complexity. Almost none of the metadata fits in memory, including many parts of conventional architectures we assume will always fit in memory.

A mere trillion objects is right around the limit of where the allocators, metadata, etc can be made to scale with heroic efforts before conventional architectures break down and things start to become deeply weird on the software design side. Storage engines need to be reliable, so avoiding that design frontier makes a lot of sense if you can avoid it.

It is possible to break this barrier but it introduces myriad interesting design and computer science problems for which there is little literature.

▲

toast0 3 days ago | parent | next [-]

Small files suck on normal filesystems too. There's reasons to have them, but if the stars align and you can go from M directories of N directories of O files to M directories of N files with O sub-files, it can make a lot of operations way faster, but probably not updates to individual sub-files (but if you're updating all the files and can update all of the M/N.db at once, then that probably is faster)

▲

stuartjohnson12 3 days ago | parent | prev [-]

This sounds like a fascinating niche piece of technical expertise I would love to hear more about.

What are the biggest challenges in scaling metadata from a trillion to a quadrillion objects?

▲

jandrewrogers 3 days ago | parent | next [-]

It is dependent on the intended workload but there are a few common design problems. Keep in mind that you can't just deal in the average case, you have to design for the worst possible cases of extremely skewed or pathologically biased distributions. A lot of the design work is proving worst case resource bounds under various scenarios and then proving the worst case behavior of designs intended to mitigate that.

An obvious one is bulk deletion, which is rarely fast at any scale. This may involve trillions of updates to search indexing structures, which in naive implementations could look like pointer-chasing across disk. Releasing storage to allocators has no locality because you are streaming the allocations to release off that storage in semi-random order. It is unhelpfully resistant to most scheduling-based locality optimization techniques. You also want to parallelize this as much as possible and some of these allocators will be global-ish.

The most interesting challenge to me is meta-scheduling. Cache replacement algorithms usually don't provide I/O locality at this scale so standard mitigations for cache-resistant workloads like dynamic schedule rewriting and latency-hiding are used instead. Schedulers are in the center of the hot path so you really want these to be memory-resident and fast. Their state size is loosely correlated with the number of objects, so in some extreme cases these can easily exceed available memory on large servers. You can address this by designing a "meta-scheduler" that adaptively optimizes the scheduling of scheduler state, so that the right bits are memory-resident at the right time so that the scheduler can optimally schedule its workload. It is difficult to overstate how much of a breaking change to conventional architecture this turns out to be. These add some value even if the state is memory resident but they greatly increase design complexity and make tail latencies more difficult to manage.

A more basic challenge is that you start dealing with numbers that may not be representable in 64-bits. Similarly, many popular probabilistic algorithms may not offer what you need when the number of entities is this large.

I aggressively skirted these issues for a long time before relenting. I deal more with database storage engines than filesystems, but to a first approximation "files" and "shards" are equivalent for these purposes.

	▲	comprev 3 days ago \| parent \| next [-]
		This is quite fascinating, thank you!
	▲	stuartjohnson12 2 days ago \| parent \| prev [-]
		This is fascinating. If you wrote a long essay about this, I (and probably most of hacker news) would surely love to read it.

▲

KaiserPro 3 days ago | parent | prev [-]

you really notice metadata performance (try a git checkout on EFS on AWS. loads of small files takes fucking ages) However EFS is actually pretty fast. you can get decent throughput if you're writing to just one file. but if you're trying to open 1000 1meg files to read from vs 1 1G file, it'll be much slower (unless they'd dramatically improved performance recently)

Trying to have a fast globally consistent database for quadrillion items in the _same_ name space is super hard. You need to chose a tradeoff between speed, partition resistance and consistency.

You're much better off sharding into discreet logical units. Its very rare that you need a global namespace for a filesystem. For VFX where we used lustre a lot, the large namespace was a nice to have, it was more about getting a raid-0 across file servers (well object stores) to get performance.

For filesystems specifically, if you're using folders, then you don't actually need to guarantee much outside of a folder. So long as filenames are unique to that folder, you can get away with a lot of shit you can't do in a normal database. you also don't need directories to be on the same filesystem (well in linux at least) so you can also shard by using directories as a key.

The directory-key-filesystem approach is actually hilariously simple, fast scalable and reliable. If a single server/Fs goes down it only takes out that area. On the downside it does mean that you can overwhelm/get hot spots.

▲

dekhn 3 days ago | parent [-]

We are truly spoiled by all the improvements that went into local filesystems that are lacking in network filesystems. So much of our perception of "computer is fast" is really just write-caching, read-caching, read-ahead.

	▲	KaiserPro 2 days ago \| parent [-]
		Oh nvme and commodity 10/40/100gig networks mean that NFS shares can be _faster_ than local disk In 2008 when I was a youngen, 100tb filesystem that could sustain 1-3gigabytes of streaming throughput took something like 40 racks. Huge amounts of cost and power were needed to set it up and maintain it. Any kind of random IO would kneecap the performance for everyone Now you can have a 2u server with 100tb of NVME storage and the only bottleneck is the network adaptor! not only that but its pretty cheap too.