Zero downtime migrations at Petabyte scale

We need more details on 6. This is the hard part, like you swap connection from A to B, but if B is not synced properly and you write to it then you start having diff between the two and there is no way back.

Like B is slightly out of date ( replication wise ) the service modify something, then A comes with change that modify the same data that you just wrote.

How do you ensure that B is up to date without stopping write to A ( no downtime ).

▲

mattlord 3 hours ago | parent | prev | next [-]

Blog post author here. I'm happy to answer any related questions you may have.

▲

redwood 3 hours ago | parent [-]

That 400TB in the image is a large database! I'm guessing that's not the largest in the PlanetScale fleet either. Very impressive and a reminder that you're strongly differentiated against some of the recent database upstarts in terms of battle tested mission critical scale. Out of curiosity how many of these large clusters are using your true managed 'as a service' offering or are they mostly in the bring your own cloud mode? Do you offer zero downtime migrations from bring your own cloud to true as a service?

▲

mattlord 2 hours ago | parent [-]

That particular cluster has grown significantly since the post was written, and yes there are now quite a few others that are challenging it for the "largest" claim. :-)

These larger ones are fully using the PlanetScale SaaS, but they are using Managed -- meaning that there are resources dedicated to and owned by them. You can read more about that here: https://planetscale.com/docs/vitess/managed

All of the PlanetScale features, including imports and online schema migrations or deployment requests (https://planetscale.com/docs/vitess/schema-changes/deploy-re...) are fully supported with PlaneScale Managed.

	▲	redwood 2 hours ago \| parent [-]
		Understood: that's great for your customers' EDP negotiations with their cloud providers!

▲

WaitWaitWha 2 hours ago | parent | prev | next [-]

I split step 4 in their "high level, this is the general flow for data migrations".

4.0 Freeze old system

4.1 Cut over application traffic to the new system.

4.2 merge any diff that happened between snapshot 1. and cutover 4.1

4.3 go live

to me, the above reduces the pressure on downtime because the merge is significantly smaller between freeze and go live, than trying to go live with entire environment. If timed well, the diff could be minuscule.

What they are describing is basically, live mirror the resource. Okay, that is fancy nice. Love to be able to do that. Some of us have a mildly chewed bubble gum, a foot of duct tape, and a shoestring.

	▲	dheera 23 minutes ago \| parent [-]
		Yeah it depends on what the system is. Lots of systems can tolerate a lot more downtime than the armchair VPs want them to have. If people don't access to Instagram for 6 hours, the world won't end. Gmail or AWS S3 is a different story. Therefore Instagram should give their engineers a break and permit a migration with downtime. It makes the job a lot easier, requires fewer engineers and cost, and is much less likely to have bugs.

▲

ksec 2 hours ago | parent | prev | next [-]

Missing 2024 in the Title.

▲

redwood 3 hours ago | parent | prev [-]

Worth underlining that this is data migrations from one database server or system to another rather than schema migrations