Remix.run Logo
diggan 15 hours ago

> but it sounds a little bit nerve wracking.

As long as you're doing backups (you are doing backups, right?), and validate that those backups work (you are validating that those backups work, right?), what's making you nervous about it?

mmcnl 15 hours ago | parent | next [-]

Doing backups and validating backups is very error-prone and time consuming. To me this reads as: "If you do all the hard complex work yourself, what's making you nervous about it?"

It's far easier to do backups and database hosting at scale. Database failures are rare, so it's this one-off situation that you have to be prepared for. That requires clearly defined processes, clearly defined roles and responsibilities, and most importantly: feedback from unfortunate incidents that you can learn from. All that is very hard to accomplish when you do self-hosting.

ownagefool 14 hours ago | parent | next [-]

It's actually probably a more difficult problem at scale.

When you have a single smallish schema, you export, restore, and write automated tests that'll probably prove that backups in 10 minutes ( runtime, development time few days / weeks ). Either the transaction runs or errors, and either the test passes or not.

The problem when small is obviously knowledge, skills, and procedures.

Things like:

- What if the monitoring that alerts me that the backups are down, is also actually down. - What do you mean it's no longer "safe" to kubectl delete pvc --all? - What do you mean there's nobody around with the skills to unfuck this? - What do you mean I actually have to respond to these alerts in a timely manner?

The reality is, when the database is small, it typically doesn't cost a whole lot, so there's a lack of incentive to really tool and skill for this when you can get a reasonable managed service.

I typically have those skills, but still use a managed service for my own startup because it's not worth my time.

Once the bill is a larger than TCO of self-hosting you have another discussion.

diggan 13 hours ago | parent | prev [-]

> Doing backups and validating backups is very error-prone and time consuming

Right, but regardless of using a managed database service or self-hosted database, this is something you probably are doing anyways, at least for a production service with real users. Sure, the managed service probably helps with you with the details of how the backup is made, where it's stored, and how the restore process happens, but you still need to validate your backups and the rest, so replicate that process/experience with your self-hosted setup and you're nearly there.

mewpmewp2 15 hours ago | parent | prev | next [-]

Yeah, that's what is a bit odd to me. I feel like AWS and everything like that is much more of a black box compared to some things like Postgres that is so fully tested, proven to be reliable, etc.

tux3 15 hours ago | parent | prev | next [-]

I think managing stateless infrastructure is much easier, if anything goes haywire you can expect a readiness probe to fail, k8s quietly takes down the instance, and life continues with no downtime.

It is also perfectly possible to roll your own highly-available Postgres setup, but that requires a whole another set of precise configuration, attention to details, caring about the hardware, occasionally digging into kernel bugs, and so forth that cloud providers happily handle behind the scene. I'm very comfortable with low-level details, but I have never built my own cloud.

I do test my backups, but having to restore anything from backups means something has gone catastrophically wrong, I have downtime, and I probably have lost data. Everything to prevent that scenario is what's making me sweat a little bit

JimBlackwood 14 hours ago | parent | next [-]

> occasionally digging into kernel bugs

Haha, been there! We recently had outages on kube-proxy due to a missing `—set-xmark` option in iptables-restore on Ubuntu 24.04.

On any stateful server we always try to be several major versions behind due to issues like above - that really avoids most kernel bugs and related issues.

lossolo 14 hours ago | parent | prev [-]

> occasionally digging into kernel bugs

No, it doesn't. I've been self-hosting a multi-node, highly available, and fault-tolerant PostgreSQL setup for years, and I've never had to go to that level. After reading your whole post, I'm not sure where you're getting your information from.

tux3 14 hours ago | parent [-]

Horror stories stick with me more than success stories, but I'm happy to take the feedback. I'm glad it went well for you, that's a small update for me.

fipar 15 hours ago | parent | prev | next [-]

Backups with periodic restore validation (which is not trivial) are a must, but don’t make database maintenance any less nerve wracking.

Sure, you won’t lose data, but the downtime …

evantbyrne 14 hours ago | parent | prev | next [-]

How do you prefer to collect backups when self-hosting postgres?

zie 13 hours ago | parent | next [-]

We use barman too, but we do hourly and daily restores into different database instances.

So for example our prod db is tootie_prod We setup another instance the restores from barman every hour and renames the db to tootie_hourly.

We do the same thing daily.

This means we have backup copies of prod that are great for customer service and dev troubleshooting problems. You can make all the changes you want to _daily or _hourly and it will all get erased and updated in a bit.

Since _hourly and _daily are used regularly, this ensures that our backups are working too, since they are now a part of our daily usage to ensure they never break for long.

JimBlackwood 12 hours ago | parent [-]

Hey, this is a pretty neat idea! I might just use this :)

zie 2 hours ago | parent [-]

Please do!

JimBlackwood 14 hours ago | parent | prev [-]

Not OP, but:

Barman on the host with a cronjob for physical backups and as archive/restore command for wal archiving and point in time recovery.

Another cronjob for logical backups.

They all ship to some external location (S3/SFTP) for storage.

I like the above since it adds minimal complexity, uses mainly native postgres commands and gives pretty good reliability (in our setup, we’d lose the last few minutes of data in the absolute worst case).

anal_reactor 14 hours ago | parent | prev [-]

In Spain when you want to travel by high-speed train, you need to though security check, just like at an airport. Do the security checks make sense? No. But nobody wants to be the politician that removes the security checks, and then something bad happens. So the security checks stay.