Remix.run Logo
jakewins 3 days ago

I used to say this as well but like.. industry has, for a long time now equated “durable” with “stored on disk”. Any DBA will assume that’s what it means, and use that fact when they work out the replication they need either in clustering or in raid.

If you’re building a data storage system and are using the term “durable” to mean “it’s in RAM on three virtual machines”, for example, I don’t think it’s unfair to say that you are lying to your customers, because you are intentionally misusing a well-established term.

zbentley 3 days ago | parent | next [-]

I forget the product, but more than a decade ago I remember someone broke out their durability into a table with columns for all the settings their data store offered between “ram on one node” and “fsync confirmed on a quorum of nodes’ disks” and rows for example failure cases ranging from “unexpected reboot of one machine” to “catastrophic loss of quorum-1 machines”. Cells were data loss risks from “prevented” to “possible” to “likely”.

That was very helpful when choosing durability levels.

klodolph 3 days ago | parent | prev [-]

I don’t have any respect for the viewpoint that “durable” is equatable with “stored on disk”, and I don’t want to spend time accommodating that viewpoint. It is just an oversimplification in a very bad way.

AFRs and discussions about different failure scenarios are the bare minimum. The bare minimum for scenarios is disk loss, total machine loss, and data center loss. This is just my take on things. I don’t care if something is on disk or not. I do care what happens when a sector on disk goes bad, when a faulty power supply destroys all the disks in a machine, or when a data center floods.

That forces you to think about things like whether you want to turn on synchronous replication.

jakewins 2 days ago | parent [-]

The point of “durable” implying stored to durable media is precisely that it allows the operator of the system to make that kind of calculation. They know the disks they picked and the replication chosen, and as long as the database calls fsync, their calculations will work.

My beef is with database systems that use the argument you made further up thread to skip fsync to juice their performance numbers. Data is not “durable” if turning off the machines storing it means it’s lost, that’s a category difference, not a pure probability difference as you are claiming.

It is of course totally fine to not store data to durable media and say the risk of devops doing a coordinated reboot is as low as the risk of raid disk data loss, but then don’t use the word “durable”.

klodolph 6 hours ago | parent [-]

That definition of durable doesn’t seem useful to me, sorry. I want the failure rates and scenarios.