▲ | stuartjohnson12 3 days ago | ||||||||||||||||
This sounds like a fascinating niche piece of technical expertise I would love to hear more about. What are the biggest challenges in scaling metadata from a trillion to a quadrillion objects? | |||||||||||||||||
▲ | jandrewrogers 3 days ago | parent | next [-] | ||||||||||||||||
It is dependent on the intended workload but there are a few common design problems. Keep in mind that you can't just deal in the average case, you have to design for the worst possible cases of extremely skewed or pathologically biased distributions. A lot of the design work is proving worst case resource bounds under various scenarios and then proving the worst case behavior of designs intended to mitigate that. An obvious one is bulk deletion, which is rarely fast at any scale. This may involve trillions of updates to search indexing structures, which in naive implementations could look like pointer-chasing across disk. Releasing storage to allocators has no locality because you are streaming the allocations to release off that storage in semi-random order. It is unhelpfully resistant to most scheduling-based locality optimization techniques. You also want to parallelize this as much as possible and some of these allocators will be global-ish. The most interesting challenge to me is meta-scheduling. Cache replacement algorithms usually don't provide I/O locality at this scale so standard mitigations for cache-resistant workloads like dynamic schedule rewriting and latency-hiding are used instead. Schedulers are in the center of the hot path so you really want these to be memory-resident and fast. Their state size is loosely correlated with the number of objects, so in some extreme cases these can easily exceed available memory on large servers. You can address this by designing a "meta-scheduler" that adaptively optimizes the scheduling of scheduler state, so that the right bits are memory-resident at the right time so that the scheduler can optimally schedule its workload. It is difficult to overstate how much of a breaking change to conventional architecture this turns out to be. These add some value even if the state is memory resident but they greatly increase design complexity and make tail latencies more difficult to manage. A more basic challenge is that you start dealing with numbers that may not be representable in 64-bits. Similarly, many popular probabilistic algorithms may not offer what you need when the number of entities is this large. I aggressively skirted these issues for a long time before relenting. I deal more with database storage engines than filesystems, but to a first approximation "files" and "shards" are equivalent for these purposes. | |||||||||||||||||
| |||||||||||||||||
▲ | KaiserPro 3 days ago | parent | prev [-] | ||||||||||||||||
you really notice metadata performance (try a git checkout on EFS on AWS. loads of small files takes fucking ages) However EFS is actually pretty fast. you can get decent throughput if you're writing to just one file. but if you're trying to open 1000 1meg files to read from vs 1 1G file, it'll be much slower (unless they'd dramatically improved performance recently) Trying to have a fast globally consistent database for quadrillion items in the _same_ name space is super hard. You need to chose a tradeoff between speed, partition resistance and consistency. You're much better off sharding into discreet logical units. Its very rare that you need a global namespace for a filesystem. For VFX where we used lustre a lot, the large namespace was a nice to have, it was more about getting a raid-0 across file servers (well object stores) to get performance. For filesystems specifically, if you're using folders, then you don't actually need to guarantee much outside of a folder. So long as filenames are unique to that folder, you can get away with a lot of shit you can't do in a normal database. you also don't need directories to be on the same filesystem (well in linux at least) so you can also shard by using directories as a key. The directory-key-filesystem approach is actually hilariously simple, fast scalable and reliable. If a single server/Fs goes down it only takes out that area. On the downside it does mean that you can overwhelm/get hot spots. | |||||||||||||||||
|