Remix.run Logo
Lichtso 4 days ago

Another similar data structure which has a balanced tree (instead of a list) that references array segments is the https://en.wikipedia.org/wiki/Rope_(data_structure)

Its main advantages are the O(log n) time complexity for all size changes at any index, meaning you can efficiently insert and delete anywhere, and it is easy to implement copy-on-write version control on top of it.

monkeyelite 4 days ago | parent [-]

A lot of standard libraries have tried to implement ropes for things like strings and it usually is reverted for a simpler structure.

josephg 4 days ago | parent [-]

There's a good reason for that. Almost all strings ever created in programs are either very small, immutable or append-only. Eg, text labels in a user interface, body of a downloaded HTTP request or a templated HTML string, respectively. For these use cases, small string optimisations and resizable vecs are better choices. They're simpler and faster for the operations you actually care about.

The only time I've ever wanted ropes is in text editing - either in an editor or in a CRDT library. They're a good choice for text editing because they let users type anywhere in a document. But that comes at a cost: Rope implementations are very complex (skip lists have similar complexity to a b-tree) and they can be quite memory inefficient too, depending on how they're implemented. They're a bad choice for small strings, immutable strings and append only strings - which as I said, are the most common string types.

Ropes are amazing when you need them. But they don't improve the performance of the average string, or the average program.

monkeyelite 3 days ago | parent [-]

Yes. But also overwhelming consensus is that complex indirect data structures just don’t end up performing on modern hardware due to cache and branch prediction.

Only use them when the theoretical algorithmic properties make them the only tool for the job.

OskarS 3 days ago | parent | next [-]

They have their place. Certainly B-tree data-structures are tremendously useful and usually reasonably cache friendly. And if std::deque weren't busted on MSVC, there are times where it would be very useful. Linked lists have their place as well; a classic example would be an LRU cache, which is usually implemented as a hash table interleaved with a doubly linked list.

But yeah. Contiguous dynamic arrays and hash tables, those are usually what you want.

josephg 3 days ago | parent | prev [-]

Thats not at all true though?

If you have a small dataset, yeah, memcpy will outperform a lot of indirect pointer lookups. But that doesn't stay true once you're memcpying around megabytes of data. The trick with indirect data structures on modern hardware is to tune the size of internal nodes and leaf nodes to make the cache misses worth it. For example, Binary trees are insanely inefficient on modern hardware because the internal nodes have a size of 2. If you give them a size of 64 or something, they perform much better. (Ie, make a b-tree). Likewise, a lot of bad tree implementations put just a single item in the leaf nodes. Its much better to have leaves store blocks of 128 items or something. And use memcpy to move data around within the block when needed.

This gets you the best of both worlds.

I spent about 18 months optimising a text based CRDT library (diamond types). We published a paper on it. By default, we store the editing history on disk. When you open a document, we reconstruct the document from scratch from a series of edits. After awhile, actually applying the stream of edits to a text document became the largest performance cost. Ropes were hugely useful. There's a stack of optimisations we made there to make that replay another 10x faster or so on top of most rope implementations. Using a linear data structure? Forget it. For nontrivial workloads, you 100% want indirect data structures. But you've gotta tune them for modern hardware.

monkeyelite 2 days ago | parent [-]

My comment is an observation about how this gets tried every few years in major libraries and is usually reverted. I Agree, there are use cases where these are better. But the pattern tends to be to revert to simpler data structures.