Remix.run Logo
sciurus 7 hours ago

Although this is oversimplifying things [0], in the face of partitions zookeeper emphasizes consistency over availability.

[0] https://martin.kleppmann.com/2015/05/11/please-stop-calling-...

jiggawatts 2 hours ago | parent [-]

The problem with that is all nodes stop-start is not a partition!

A partition is when some nodes can’t reach other nodes.

Zookeeper instead has an issue where it does try to restart but the timeout (why?!) is too short, something like 30 seconds. If the majority of your nodes don’t all start within a certain time window the whole cluster stays down until someone manually intervenes.

I discovered this fun feature when keeping non-prod systems off to save money in the cloud.

It also has an impact when making certain big bang changes in production.