Remix.run Logo
ndr 20 hours ago

I found it very lacking in how to do CD with no downtime.

It requires a particular dance if you ever want to add/delete a field and make sure both new-code and old-code work with both new-schema and old-schema.

The workaround I found was to run tests with new-schema+old-code in CI when I have schema changes, and then `makemigrations` before deploying new-code.

Are there better patterns beyond "oh you can just be careful"?

rorylaitila 19 hours ago | parent | next [-]

I simplify it this way. I don't delete fields or tables in migrations once an app is in production. Only manually clean them up after they are impossible to be used by any production version. I treat the database schema as-if it were "append only" - Only add new fields. This means you always "roll-forward", a database. Rollback migrations are 'not a thing' to me. I don't rename physical columns in production. If you need an old field and a new field to be running simultaneously that represent the same datum, a trigger keeps them in sync.

rtpg 3 hours ago | parent | prev | next [-]

Here's a checklist I wrote way back.

https://rtpg.co/2021/06/07/changes-checklist.html

I've been meaning to write an interactive version to sort of "prove" that you really can't do much better than this, at least in general cases.

tmarice 17 hours ago | parent | prev | next [-]

This is not specific to Django, but to any project using a database. Here's a list of a couple quite useful resources I used when we had to address this:

* https://github.com/tbicr/django-pg-zero-downtime-migrations

* https://docs.gitlab.com/development/migration_style_guide/

* https://pankrat.github.io/2015/django-migrations-without-dow...

* https://www.caktusgroup.com/blog/2021/05/25/django-migration...

* https://openedx.atlassian.net/wiki/spaces/AC/pages/23003228/...

Generally it's also advisable to set a statement timeout for migrations otherwise you can end up with unintended downtime -- ALTER TABLE operations very often require ACCESS EXCLUSIVE lock, and if you're migrating a table that already has an e.g. very long SELECT operation from a background task on it, all other SELECTs will queue up behind the migration and cause request timeouts.

There are some cases you can work around this limitation by manually composing operations that require less strict locks, but in our case, it was much simpler to just make sure all Celery workers were stopped during migrations.

senko 19 hours ago | parent | prev | next [-]

You can do three stage:

1. Make a schema migration that will work both with old and new code

2. Make a code change

3. Clean up schema migration

Example: deleting a field:

1. Schema migration to make the column optional

2. Remove the field in the code

3. Schema migration to remove the column

Yes, it's more complex than creating one schema migration, but that's the price you pay for zero-downtime. If you can relax that to "1s downtime midnight on sunday", you can keep things simpler. And if you do so many schema migrations you need such things often ... I would submit you're holding it wrong :)

ndr 18 hours ago | parent | next [-]

I'm doing all of these and None of it works out of the box.

Adding a field needs a default_db, otherwise old-code fails to `INSERT`. You need to audit all the `create`-like calls otherwise.

Deleting similarly will make old-code fail all `SELECT`s.

For deletion I need a special 3-step dance with managed=False for one deploy. And for all of these I need to run old-tests on new-schema to see if there's some usage any member of our team missed.

jgavris 19 hours ago | parent | prev [-]

I was just in the middle of writing something similar above, thanks!

aljarry 19 hours ago | parent | prev | next [-]

One option is to do multi-stage rollout of your database schema and code, over some time windows. I recall a blog post here (I think) lately from some Big Company (tm) that would run one step from the below plan every week:

1. Create new fields in the DB.

2. Make the code fill in the old fields and the new fields.

3. Make the code read from new fields.

4. Stop the code from filling old fields.

5. Remove the old fields.

Personally, I wouldn't use it until I really need it. But a simpler form is good: do the required schema changes (additive) iteratively, 1 iteration earlier than code changes. Do the destructive changes 1 iteration after your code stops using parts of the schema. There's opposite handling of things like "make non-nullable field nullable" and "make nullable field non-nullable", but that's part of the price of smooth operations.

Izkata 6 hours ago | parent [-]

2.5 (if relevant) mass-migrate data from the old column to the new column, so you don't have to wait forever.

m000 19 hours ago | parent | prev | next [-]

Deploying on Kubernetes using Helm solves a lot of these cases: Migrations are run at the init stage of the pods. If successful, pods of the new version are started one by one, while the pods of the new version are shutdown. For a short period, you have pods of both versions running.

When you add new stuff or make benign modifications to the schema (e.g. add an index somewhere), you won't notice a thing.

If the introduced schema changes are not compatible with the old code, you may get a few ProgramingErrors raised from the old pods, before they are replaced. Which is usually acceptable.

There are still some changes that may require planning for downtime, or some other sort of special handling. E.g. upgrading a SmallIntegerField to an IntegerField in a frequently written table with millions of rows.

ndr 18 hours ago | parent [-]

Without care new-schema will make old-code fail user requests, that is not zero downtime.

m000 17 hours ago | parent [-]

A request not being served can happen for a multitude of reasons (many of them totally beyond your control) and the web architecture is designed around that premise.

So, if some of your pods fail a fraction of the requests they receive for a few seconds, this is not considered downtime for 99% of the use cases. The service never really stopped serving requests.

The problem is not unique to Django by any means. If you insist on being a purist, sure count it as downtime. But you will have a hard time even measuring it.

jgavris 19 hours ago | parent | prev [-]

The general approach is to do multiple migrations (add first and make new-code work with both, deploy, remove old-code, then delete old-schema) and this is not specific to Django's ORM in any way, the same goes for any database schema deployment. Take a peek at https://medium.com/@pranavdixit20/zero-downtime-migrations-i... for some ideas.