Remix.run Logo
the_duke 5 days ago

The big problem with CRDTs IMO is that they make it incredibly easy to break application semantics.

Just a basic example for a task tracker:

* first update sets task cancelled_at and cancellation_reason

* second update wants the task to be in progress, so sets started_at

If code just uses the timestamps to consider the task state, it would not assume the task is cancelled, unexpected since the later user update set it to in progress.

Easy fix, we just add a state field 'PENDING|INPROGRESS|CANCELLED|...'.

Okay, but now you have a task that is in progress, but also has a cancellation timestamp, which seems inconsistent.

The point is:

With CRDTs you have to consider how partial out of order merges affect the state, and make sure your logic is always written in a way so these are handled properly. That is *not easy*!

I'd love it if someone came up with a framework that allows defining application semantics on top of CRDTs, and have the framework ensure types remain consistent.

tempodox 5 days ago | parent | next [-]

Do not separate the state field from its time stamp(s). Use a sum type (“tagged union”) where the time stamps are the payload for a selected state. Make invalid states unrepresentable.

shakna 5 days ago | parent | next [-]

If you want invalid states unrepresentable, and time as a primary key... How do you deal with time regularly becoming non-linear within the realm of computing?

josephg 5 days ago | parent | next [-]

The general answer is to accept that time isn’t linear. In a collaborative editing environment, every event happens after some set of other events based on what has been observed locally on that peer. This creates a directed acyclic graph of events (like git).

shakna 5 days ago | parent [-]

That requires a different primary key than the time then, no?

josephg 4 days ago | parent [-]

It requires a different primary key than an autoincrementing integer. One popular choice is to use a tuple of (peer_guid, incrementing integer). Or a randomly generated GUID, or a hash of the associated data.

Then each event is associated with zero or more "parent events".

- An event has 0 parents if it is the first change

- An event has 1 parent if it simply came after that event in sequence

- And if an event merges 2 or more branches in history, it says it comes after all of those events

You can also think about it like a set. If I know about events {A, B, C} and generate event D, then D happens-after {A, B, C}. (Written {A,B,C} -> D). But if A->B, then I only need to explicitly record that {B,C} -> D because the relationship is transitive. A -> B -> D implies A -> D.

shakna 4 days ago | parent [-]

And the moment you need to merge, unrepresentable states become possible.

There are techniques to make it less painful, but it will still be possible.

tempodox 4 days ago | parent [-]

You mean, like attempting to merge contradictory states? You will need some resolution stategy then, but in general that would be application-specific, and sometimes it may not exist.

shakna 4 days ago | parent [-]

Okay... But we're now back to invalid states being possible. Tagging with time isn't enough.

josephg 4 days ago | parent [-]

It isn’t enough for what? What are you trying to do?

There may be a way to solve whatever problem you have, but without specifics it’s impossible to tell.

johnecheck 5 days ago | parent | prev | next [-]

It might be nice if our universe conformed to our intuitions about time steadily marching forward at the same rate everywhere.

Einstein just had to come along and screw everything up.

Causality is the key.

throwawaymaths 5 days ago | parent | prev [-]

logical clocks

Phelinofist 5 days ago | parent [-]

Even with a hybrid logical clock they are not that useful in case of a network partition AND clock drift, no?

the_duke 5 days ago | parent | prev [-]

There are many ways to solve each individual problem.

The point is that you always have to think about merging behaviour for every piece of state.

fauigerzigerk 5 days ago | parent | next [-]

Yes, sort of like you have to think about your transaction boundaries in server-side code for every single task.

The difference is that coming up with a correct CRDT solution for application specific consistency requirements can be a research project. In many cases, no CRDT solution can exist.

josephg 5 days ago | parent [-]

Can you give some examples?

In my experience, 95% of applications are handled just fine by the sort of JSON types built in to Yjs or automerge. The problems I hear people complain about are things like performance, size on disk and library ergonomics. And the long tail of features - like ephemeral data support and binary assets.

But data mapping seems mostly fine?

I know of a couple of exceptions. Arbitrary nested tree reparenting can be a nightmare. And there aren’t many good rich text implementations out there.

What problems are you actually running into?

fauigerzigerk 5 days ago | parent [-]

There are two rather different issues.

One large class of problems I'm thinking of is simply outside the scope of CRDTs. The whole idea of _eventual_ consistency doesn't really work for things like payment systems or booking systems. A lot of OLTP applications have to be consistent at all times (hence the O). Money must not be double spent. Rooms or seats must not be double booked.

The other class of problems is more debatable. CRDTs can guarantee that collaborative text editing results in the same sequence of letters on all nodes. They cannot guarantee that this sequence makes sense. Authors can step on each other's toes.

Whether or not this is a problem depends on the specific workflow and I think it could be mitigated by choosing better units of storage/work (such as paragraphs rather than letters).

josephg 5 days ago | parent | next [-]

> One large class of problems I'm thinking of is simply outside the scope of CRDTs. The whole idea of _eventual_ consistency doesn't really work for things like payment systems or booking systems.

Yes! I think of it as owned data and shared data. Owned data is data that is owned by one process or node. Eg my bank balance, the position of my mouse cursor, the temperature of my CPU. For this stuff, you don’t want a crdt. Use a database. Or a variable in memory or a file on disk. Broadcast updates if you want, but route all write requests through the data’s owner.

Then there’s shared data - like the source code for a project or an apple note. There, CRDTs might make sense - especially if you get branching and merging support along for the ride.

> Authors can step on each other's toes.

Yeah when merging long lived branches, the workflow most people want is what git provides - of humans manually resolving conflicts. There’s no reason a crdt couldn’t provide this. CRDTs have a superset of the information available to git. It’s weird nobody has coded a system like that up yet.

withinboredom 5 days ago | parent | next [-]

I think you have the right idea, but possibly the wrong perspective. You want your _source of truth_, which is the "owned data" to be strongly consistent. Your shared data is a "view of truth" which may be incomplete or in disagreement with the source of truth. For example, the color of the sky "right now" depends on where on the earth you are standing, but we can all agree that air is 'just barely blue' and it depends on the light shining into it and how much of there exists.

The _source of truth_ are these facts (like "the air is blue" or "the user inserted the letter A at position X" or "the CPU is 40 degrees"). The view of this source is what we see, and can be seen through a CRDT or any other lens.

josephg 5 days ago | parent [-]

The way I’m defining it, my shared state is the data we store in a crdt. And CRDTs have strong eventual consistency. That’s what makes them great. So we can have a data structure which shows all users an identical view of the world.

Normally we do that by storing something totally different under the hood. Eg, git actually stores a commit graph. But the system makes a determinism guarantee: we promise that all users who have the same version checked out will see exactly the same thing. At one level, we’re storing “a list of facts” (the commit graph). But at another level of abstraction, we’re just storing application data. It’s just also replicated between many peers. And editable locally without network access.

withinboredom 4 days ago | parent [-]

> So we can have a data structure which shows all users an identical view of the world.

This is never true. You can prove that at some time now()-T where T > 0 you had the same view of the universe, but you cannot prove that you currently have the exact same view because even with the attempt of checking, T becomes greater than 0. Sometimes, this doesn't matter (T can be arbitrarily large and still effectively be zero -- like asking your friend if he is still married to that person. They can answer you days later, and it'll still be true), but sometimes even very small values of T cannot be assumed to be zero.

josephg 4 days ago | parent [-]

Well yeah obviously you never know for sure that a remote peer doesn’t have some changes that they haven’t told you about yet. That’s also true with lots of platforms - like google docs and Notion and multiplayer video games. Seems fine though? I don’t understand why this matters for collaborative editing?

withinboredom 4 days ago | parent [-]

Have you ever worked on the same repo with >500 devs? 99% of the time, it doesn’t matter. People talk to people.

josephg 3 days ago | parent [-]

Yes; but I have no idea how that connects to anything else we’ve been discussing here.

sethev 5 days ago | parent | prev | next [-]

Pijul is a version control system based on a CRDT: https://pijul.org/manual/theory.html#conflicts-and-crdts

It works like you describe, with humans manually resolving conflicts. The conflicts are represented in the data model, so the data model itself converges without conflicts...if that makes sense.

johnecheck 5 days ago | parent | prev | next [-]

Conflict-free is right in the name, layering conflicts on top of it would be blasphemy :p

evelant 5 days ago | parent | prev | next [-]

See my comment below, I prototyped something like this. https://news.ycombinator.com/item?id=45180325

josephg 5 days ago | parent [-]

Interesting idea. As I understand it though, this wouldn’t give you the kind of conflict semantics I’m talking about out of the box. What I want is - if two users concurrently edit the same line of text, the system can “merge” those changes by storing the conflict. Subsequent readers of the document see a merge conflict and can resolve the conflict manually.

Your system looks like it just enforces a global order on the actions. This will give you SEC - but how do you preserve the information that these edits were concurrent - and thus conflict with one another?

evelant 4 days ago | parent [-]

You're right, it's not the same as conflict/merge semantics, but you probably could implement those semantics on top of it. My idea was more about being able to merge offline states for arbitrary data without user intervention while also ensuring that application invariants / semantics are preserved. Preserving app semantics while as much as possible preserving user intentions.

fauigerzigerk 5 days ago | parent | prev [-]

>CRDTs have a superset of the information available to git. It’s weird nobody has coded a system like that up yet.

That's an interesting idea. I have to think about this.

gritzko 5 days ago | parent | prev [-]

The classical paper-ledger bookkeeping is pretty much eventually consistent. They did not have the Internet when they invented it.

Flight booking is often statistically consistent only. Overbooking, etc.

fauigerzigerk 5 days ago | parent [-]

>The classical paper-ledger bookkeeping is pretty much eventually consistent. They did not have the Internet when they invented it.

Absolutely. Bookkeeping is an offline activity (I'm only doing it once a year in my company, ha ha). You just have to make sure not to record the same transaction more than once, which could be non-trivial but shouldn't be impossible to do with CRDTs.

>Flight booking is often statistically consistent only. Overbooking, etc.

That may be acceptable in some cases but you still can't use CRDTs for it, because you need a way to limit the extent of overbooking. That requires a centralised count of bookings.

josephg 5 days ago | parent [-]

Most complex crdts are built on top of the simple crdt of a grow only set. Ie, what we actually synchronise over the network is a big bag of commits / operations / something such that the network protocol makes sure everyone ends up with all of the operations known to any peer. Then the crdt takes that big set and produces some sort of sensible projection from it.

> You just have to make sure not to record the same transaction more than once

So this should be pretty easy. Have a grow only set of transactions. Give each one a globally unique ID at the point of creation. Order by date and do bookkeeping. One thing you can’t guarantee is that the balance is always positive. But otherwise - yeah.

naasking 5 days ago | parent | prev [-]

> The point is that you always have to think about merging behaviour for every piece of state.

CRDTs can't eliminate the requirement to think about what the consistent states are.

evelant 5 days ago | parent | prev | next [-]

I prototyped exactly such a framework! It's designed to solve exactly the problem you mentioned. It’s a super interesting problem. https://github.com/evelant/synchrotron

The gist is:

* Replicating intentions (actions, immutable function call definitions that advance state) instead of just replicating state.

* Hybrid logical clocks for total ordering.

* Some client side db magic to make action functions deterministic.

This ensures application semantics are always preserved with no special conflict resolution considerations while still having strong eventual consistency. Check out the readme for more info. I haven’t gotten to take it much further beyond an experiment but the approach seems promising.

the_duke 5 days ago | parent | next [-]

Nice, will have a look!

I've had similar thoughts, but my concern was: if you have idempotent actions, then why not just encode them as actions in a log. Which just brings you to event sourcing, a quite well-known pattern.

If you go that route, then what do you need CRDTs for?

ForHackernews 5 days ago | parent | next [-]

Doesn't event-sourcing imply that there's a single source-of-truth data store you can source them from? I'm not sure event sourcing says anything about resolving conflicts or consistency.

evelant 5 days ago | parent | prev | next [-]

The pattern I came up with is similar to event sourcing but with some CRDT and offline-first concepts mixed in. By using logical clocks and a client side postgres (pglite) it doesn't have to keep the entire event history for all time and the server side doesn't have to process actions/events at all beyond storing them. The clients do the resolution of state, not the server. Clients can operate offline as long as they like and the system still arrives at a consistent state. AFAIK this is different than most event sourcing patterns.

At least in my thinking/prototyping on the problem so far I think this solution offers some unique properties. It lets clients operate offline as long as they like. It delegates the heavy lifting of resolving state from actions/events to clients, requiring minimal server logic. It prevents unbounded growth of action logs by doing a sort of "rebase" for clients beyond a cutoff. It seems to me like it maximally preserves intentions without requiring specific conflict resolution logic. IMO worth exploring further.

n0w 5 days ago | parent | prev [-]

A CRDT is any data structure that meets the definition (associative, commutative, idempotent, etc...)

Event Sourcing is not strictly designed to achieve eventual consistency in the face of concurrent writes though. But that doesn't mean it can't be!

I've also been considering an intent based CRDT system for a while now (looking forward to checking out GPs link) and agree that it looks/sounds very much like Event Sourcing. It's worth while being clear on the definition/difference between the two though!

SkiFire13 5 days ago | parent | prev [-]

I wonder how does this handle a modify-rename conflict? e.g. there's a file identified by its name `a` and one client renames it to `b` while another client tries to modify the contents of `a`. Once you replay it in this order does the intent of modifying the contents of what was once `a` remain?

I know you can use some unique persistent ids instead of names, but then you get into issues that two clients create two files with the same name: do you allow both or not? What if they initially create it equal? What if they do so but then they modify it to be different?

evelant 3 days ago | parent [-]

It would be up to application logic. This prototype essentially offers the same behavior you would get with a traditional backend API except it works offline. The results would be the same as if clients made those calls to a backend api, that is, up to application logic. My idea was that it's essentially impossible to have generic "conflict resolution" that follows arbitrary business rules so I made the business rules _be_ the conflict resolution. For any given situation the answer to "how would it handle a then b then c" is "the same as any normal backend api, per regular business logic, except it works offline".

littlecosmic 5 days ago | parent | prev | next [-]

Don’t you also have to consider this just as much without CRDT? Not saying it isn’t a real issue, but this example could easily be a problem with a more traditional style app - maybe users open the record on their web browser at same time and make different updates, or they update the different timestamp fields directly in a list of tasks.

the_duke 5 days ago | parent [-]

Sure, but you can usually rely on database transactions to handle the hard part.

filleokus 5 days ago | parent | prev [-]

Yes!

Any many CRDT implantations have already solved this for the styled text domain (e.g bold and cursive can be additive but color not etc).

But something user definable would be really useful