Remix.run Logo
jknoepfler 2 days ago

I use Patroni for that in a k8s environment (although it works anywhere). I get an off-the-shelf declarative deployment of an HA postgres cluster with automatic failover with a little boiler-plate YAML.

Patroni has been around for awhile. The database-as-a-service team where I work uses it under the hood. I used it to build database-as-a-service functionality on the infra platform team I was at prior to that.

It's basially push-button production PG.

There's at least one decent operator framework leveraging it, if that's your jam. I've been living and dying by self-hosting everything with k8s operators for about 6-7 years now.

tempest_ a day ago | parent [-]

We use patroni and run it outside of k8s on prem, no issues in 6 or 7 years. Just upgraded from pg 12 to 17 with basically no down time without issue either.

baobun a day ago | parent [-]

Yo I'm curious if you have any pointers on how you went about this to share? Did you use their provided upgrade script or did you instrument the upgrade yourself "out of band"? rsync?

Currently scratching my head on what the appropriate upgrade procedure is for a non-k8s/operator spilo/patroni cluster for minimal downtime and risk. The script doesn't seem to work for this setup, erroring on mismatching PG_VERSION when attempting. If you don't mind sharing it would be very appreciated.

tempest_ 8 hours ago | parent [-]

I did not use a script (my environment is bare metal running ubuntu 24).

I read these and then wrote my own scripts that were tailored to my environment.

https://pganalyze.com/blog/5mins-postgres-zero-downtime-upgr...

https://www.pgedge.com/blog/always-online-or-bust-zero-downt...

https://knock.app/blog/zero-downtime-postgres-upgrades

Basically

- Created a new cluster on new machines

- Started logically replicating

- Waited for that to complete and then left it there replicating for a while until I was comfortable with the setup

- We were already using haproxy and pgbouncer

- Then I did a cut over to the new setup

- Everything looked good so after a while I tore down the old cluster

- This was for a database 600gb-1tb in size

- The client application was not doing anything overly fancy which meant there was very little to change going from 12 to 17

- Additionally I did all of the above in a staging environment first to make sure it would work as expected

Best of luck.