Remix.run Logo
koverstreet 2 hours ago

I think their claims of write amplification reduction are a bit overstated given more realistic workloads.

It is true that b-trees aren't ideal in that respect, and you will see some amount of write amplification, but not enough that it should be a major consideration, in my experience

You really have to take into account workingset size and cache size to make any judgements there; your b-tree writes should be given by journal/WAL reclaim, which will buffer up updates.

A purely random update workload will kill a conventional b-tree on write amplification - like I mentioned, that's the absolute worst case scenario for a b-tree. But it just doesn't happen in the real world.

For the data I can give you, that would be bcachefs's hybrid b-tree - large btree nodes (256k, typically) which are internally log structured; I would consider it a minor variation on a classical b-tree. The log structuring mean that we can incrementally write only the dirty keys in a node, at the cost of some compaction overhead (drastically less than a conventional LSM).

In actual real world usage, when I've looked at the numbers (not recently, so this may have changed) we're always able to do giant highly efficient b-tree writes - the journal and in-memory cache are batching things up as much as we want - which means write amplification is negligible.

fuzzybear3965 2 hours ago | parent [-]

Of course mileage may vary with different workloads, but are there any good benchmarks/suites to use for comparison in cases like these? They used YCSB but I don't know if those workloads ([1]) are relevant to modern/typical access patterns nor if they're applicable to SQL databases.

You thinking about running some benchmarks in a bcachefs branch (:pray:)?

I want to see this data structure prototyped in PostgreSQL.

[1]: https://github.com/brianfrankcooper/YCSB/tree/master/workloa...

koverstreet an hour ago | parent [-]

I've got microbenchmarks for the bcachefs btree here: https://evilpiepirate.org/git/bcachefs.git/tree/fs/bcachefs/...

They're ancient, I only have pure random and sequential benchmarks - no zipf distribution, which really should be included.

Feel free to play around with them if you want :) I could even find the driver code, if you want.

I've always been curious about PostgreSQL's core b-tree implementation. I ran into a PostgreSQL developer at a conference once, and exchanged a few words that as I recall were enough to get me intrigued, but never learned anything about it.

In a system as big, complex and well optimized as either bcachefs or postgres, the core index implementation is no longer the main consideration - there's layers and layers, and the stuff that's fun to optimize and write paper about eventually gets buried (and you start thinking a lot more about how to lay out your data structures and less about optimizing the data structures themselves).

But you know in something like that there's going to be some clever tricks, that few people know about or even remember anymore :)