A couple dumb questions:

1. You say "ephemeral", but my understanding is that NVMe is non-volatile so upon crash and restart we should be able to recover the state of the memory. Is is ephemeral because of how EC2 works where you might not get that same physical box and memory addresses back?

2. Can you explain what "Semi-synchronous replication" is? Your docs say "This ensures every write has reached stable storage in two availability zones before it’s acknowledged to the client." but I would call that synchronous since the write is blocked until it is replicated.

Thanks!

▲

maxenglander 4 days ago | parent | next [-]

Hi n_u,

When we say ephemeral we mean that if the host compute dies in a permanent way (which happens from time to time) the data on the NVMe drives attached to that host is not recoverable by us. AWS/GCP might have recovery mechanisms internally it, but we don't have API access to those APIs.

When we say "semi-synchronous replication" we mean it in the sense of MySQL semi-synchronous replication: https://dev.mysql.com/doc/refman/8.4/en/replication-semisync.... To be honest I'm not exactly sure where the "semi" comes from but here are two possible reasons I can think of why:

1. We actually only require that 1 of the 2 replicas sends an acknowledgement to the primary that it has durably stored the transaction to its relay log before the primary in turn sends an acknowledgement back to the client application. 2. The transaction is visible (can be read) on both the primary and the replica _before_ the primary sends back an acknowledgement that the transaction was committed back to the client application.

	▲	n_u 4 days ago \| parent [-]
		Thanks! I see. It's maybe a term they came up with to place it between async and fully synchronous replication.

▲

rcrowley 4 days ago | parent | prev [-]

I think we've got (1) covered elsewhere in the comment tree.

For (2), semi-synchronous replication is a MySQL term which we realize in Postgres is by using synchronous replication with ANY one of the available replicas acknowledging the write. This allows us to guarantee durability in two availability zones before acknowledging writes to clients.

In MySQL the _semi_ part of semi-synchronous replication refers to the write only needing to be written to the binary log on the replica and not (necessarily) applied to InnoDB. This is why a MySQL database might be both acknowledging semi-synchronous writes and reporting non-zero replication lag.

▲

n_u 4 days ago | parent [-]

> write only needing to be written to the binary log on the replica and not (necessarily) applied to InnoDB.

Ah. I wonder are writes in the log but not yet in InnoDB are available for reads? Then your write may succeed but a subsequent read from a replica would not see it so you lose read-after-write consistency. Perhaps that's another tradeoff.

I'll have to research a bit more but the MySQL docs [1] say "requires only an acknowledgement from the replicas, not that the events have been fully executed and committed on the replica side" which implies that it can't be read yet.

Thanks!

[1] https://dev.mysql.com/doc/refman/8.4/en/replication-semisync...

▲

rcrowley 4 days ago | parent [-]

A lagging replica, even one that just acknowledged a semi-sync write, will return stale results if you route a `SELECT` to it.

First and foremost, the extra copies of the data are for fault tolerance. In specific circumstances they may offer some slack capacity that you can use to serve (potentially stale) reads but they're never going to offer read-your-writes consistency.

The docs you quote are a bit obtuse but the "acknowledgement" is the row event being written to the binary log. "Fully executed and committed" is when it makes its way into InnoDB and becomes available for future reads.

	▲	n_u 4 days ago \| parent [-]
		Ah yeah that makes sense. I guess if you are only using replica for fault tolerance and not as a read-replica then it seems semi-synchronous replication is strictly better than synchronous replication. I suppose in semi-sync the failover will take a little longer but it's probably inconsequential. Thanks!