Remix.run Logo
tux3 a month ago

Self-hosting postgres is appealing from a flexibility standpoint (don't be locked out of superuser or advanced features), but it sounds a little bit nerve wracking.

I'm now hoping all the cloud providers read this article and start exposing the feature to try disabling an index in the query planner before you drop it for real, that should really become standard procedure. Just in case.

But if you're a large scale company to the point of wanting to own and customize your stack, it can definitely make sense to self-host.

bohanoai a month ago | parent | next [-]

Speaker here — Bohan from OpenAI.

Just to clarify, we're using Azure Database for PostgreSQL, not a self-hosted setup. I mentioned "Azure Postgres" multiple times during the talk, but I should have been more specific that we're referring to Microsoft’s managed PostgreSQL service. Apologies for the confusion.

tux3 a month ago | parent [-]

Sorry, I see this has caused some confusion, I should have clarified in my comment. I was responding to the section by Lao Feng at the end, where their conclusion tries to make the case for Pigsty and self-hosted pg.

oa335 a month ago | parent | prev | next [-]

There are already several ways to force postgres to use or not use a particular index. We use cloud-hosted and managed Postgres instances on one of the major provider, and I've used a couple of these strategies to massage the query planner.

1. Fiddle with the Query Planner settings (this can be done on a per query level as well, so its not global). E.g. enable_indexscan=off, enable_indexonlyscan=off

2. Add a trivial calculation to the filter clause of the query. E.g. select * from table where indexed_col + 0 = 12345 shouldn't use the index, as the planner won't do the arithmetic.

3. Use the pg_hintplan extension, which allows you to add comments to your code to urge the planner to use certain data access strategies, including specifying which indices to use. See: https://pg-hint-plan.readthedocs.io/en/latest/hint_table.htm...

bostik a month ago | parent [-]

That second trick is neat, I'll need to remember that. I also really wish I had known about it back in 2015, when we had to massage an inconveniently big (at the time) PG database and started to carve it out into smaller, purpose-specific instances.

Being able to verify that an index was either useless or inefficient without jumping through hoops would have saved quite a lot of time.

affandar a month ago | parent | prev | next [-]

(Disclosure: I'm from Azure's Postgres team)

Clarification: OpenAI does not self-host Postgres. They use Azure's managed PostgreSQL offering (aka Azure Database for PostgreSQL Flexible server).

evanelias a month ago | parent | prev | next [-]

> I'm now hoping all the cloud providers read this article and start exposing the feature to try disabling an index in the query planner before you drop it for real

I'm surprised there isn't non-superuser DDL to handle this. For example in MySQL you can ALTER an index to make it INVISIBLE (or equivalently in MariaDB, IGNORED) and the planner won't use it.

lossolo a month ago | parent | prev | next [-]

> but it sounds a little bit nerve wracking.

> But if you're a large scale company ... it can definitely make sense to self-host.

I'm not a large company like OpenAI and I've been running various PostgreSQL setups for years—ranging from single-node instances without replication to multi-node, fault-tolerant, highly available configurations with automatic failover and backups, serving 4-5 digits of updates, selects, and inserts per second. I'm not sure what you're referring to, but in my experience, once it's set up, it's almost maintenance-free.

hobs a month ago | parent [-]

No large scale database heterogeneous cluster is maintenance-free - plans change because data changes, query patterns change, resource utilization of certain features grows or shrinks, new application features are launched, new indexes need to be added, things are constantly migrated and might need a special case because the default migration strategy doesn't perfectly handle giant bulk changes, etc etc etc etc

lossolo a month ago | parent [-]

You need to add indexes on cloud-hosted instances too, perform migrations there as well, and launch new features in your application in both cases. Almost everything you're talking about has little to do with whether you're self-hosting or using a cloud provider. You still have to deal with most of it either way.

EDIT: Don't get me wrong, I've also managed Kafka clusters, ClickHouse clusters, Elasticsearch clusters, etc. and I have my share of Zookeeper horror stories. Some of the tools I just mentioned are definitely not maintenance-free. But in my experience, you can't really compare PostgreSQL to them.

cryptonector a month ago | parent | prev | next [-]

> Self-hosting ~~postgres~~ is appealing from a flexibility standpoint [...]

whalesalad a month ago | parent | prev | next [-]

self hosting psql is trivial - what is the scary part? thats how we used it for decades until things like RDS came around.

sgarland a month ago | parent | next [-]

I would never say that self-hosting anything is trivial. Linux administration, tuning, and troubleshooting can be learned, obviously, and the same is true of RDBMS. Neither is a trivial skillset if you want to actually be able to run stuff at scale, though. There's a massive difference between 25 QPS and 25,000 QPS (TFA states ~40 replicas, and an aggregate of 1,000,000 QPS).

That shouldn't deter anyone from trying, though. You can't learn if you don't try.

olalonde a month ago | parent | prev | next [-]

Back in the 2000s, Database Administrator (DBA) was a pretty popular job title, and they usually got paid way more than regular software developers. It probably wouldn’t have been like that if managing databases was "trivial".

lossolo a month ago | parent | prev [-]

Yeah, that's baffling to me too. I started self-hosting my databases with MySQL 21 years ago, before the cloud even existed.

edoceo a month ago | parent [-]

Before the cloud? More than 21 years ago that damn cloud icon was in every Visio network diagram.

dragonwriter a month ago | parent | next [-]

The popularity of the cloud icon in diagrams well precedes the term “cloud computing” being coined for on-demand scalable (usually, but not always, remote-hosted) infrastructure and services that accompanied the explosion of such services in about the mid-00s.

edoceo a month ago | parent [-]

IME folk were saying move it to the cloud, with the meaning of a manged provider, before 2005.

lossolo a month ago | parent | prev | next [-]

> Before the cloud?

Yes, the cloud, in the modern sense (as in, on-demand scalable infrastructure like AWS), was just beginning to emerge back then, AWS launched S3 and EC2 in 2006 so 19 years ago. Other cloud services followed over the next several years.

dboreham a month ago | parent | prev [-]

Before that it was on whiteboards.

gjvc a month ago | parent | prev | next [-]

this sort of thing https://www.oracle.com/uk/autonomous-database/what-is-autono... ?

"An autonomous database is a cloud database that uses machine learning to automate database tuning, security, backups, updates, and other routine management tasks traditionally performed by DBAs. Unlike a conventional database, an autonomous database performs all these tasks and more without human intervention."

tux3 a month ago | parent [-]

That sounds like chaos engineering with extra steps! Have the AI randomly tweak your DB, and see if your infra holds up? :)

mike_hearn a month ago | parent [-]

It's not AI based. "Autonomous" is just the term Oracle uses to communicate that the database has a ton of self-management features, and that it was a big area of focus over time. If you use the recent versions it's a much less admin-heavy database than is conventional for an RDBMS. For example it can notice what queries are slow and add indexes for you, automatically. That's ordinary statistics and planning, GOFAI if you like.

Disclosure: I have a part time job with Oracle in the research division, so I very much have conflicts of interest, but OpenAI should listen to me anyway (and this post is my own opinion, obviously). Postgres is great and I use it in personal projects, but the amount of effort our industry collectively puts into working around its limitations and problems is a bit concerning. According to the blog post, their "solutions" to the problems mostly involve just not using the database, which is no solution at all. They could just subscribe to a managed Oracle DB in Azure and every problem they're having would go away. The cost is surprisingly not much different to a managed Postgres. Specifically, in the current version:

- Database nodes scale horizontally and elastically. There is not such a thing as a primary master and read replicas in an Oracle cluster, although you can have read-through cache nodes (they are not replicas). And because of the unusual way Oracle clusters are integrated with the hardware there also isn't such a thing as replication lag within the cluster: writes are fully synchronous even when scaling horizontally and reads are always up to date.

- Database clusters are HA. Node lists are automatically distributed to client drivers. If a node fails or you do a rolling upgrade on the cluster, it's transparent to the client beyond a minor elevation in latency as they retry. You can not only do rolling upgrades but also roll them across major versions.

- The MVCC engine in Postgres is strange and unique. Other databases don't have concepts like vacuuming or attendent table/index bloat. This isn't unique to Oracle, but nonetheless switching to something else means a massive productivity drain gone, just like that.

- You can do schema changes and do them online. They say they don't allow users to create new tables which I'm sorry but that's ridiculous. It's not their fault but I'd consider a database to which you can't add tables to be basically broken. You can even do arbitrary schema changes online in Oracle because the db can create a new table and copy the data across to the new schema whilst doing incremental sync, then do an in-place rename at the end. That works under heavy load.

- You can disable indexes, marking them as invisible to the planner whilst still maintaining them.

- You can multiplex transactions onto sessions and do db-side connection pooling, without things like "bouncers", so idle connection management is less of an issue.

- You can derive 95/99th percentile latencies from v$active_session_history because the wait times for every SQL statement are available.

- You can audit schema changes along with everything else.

On pricing, at these scales it just doesn't matter. An ExaData cluster might actually be cheaper. I knocked up an estimate using Azure's pricing calculator and the numbers they provide, assuming 5TB of data (under-estimate) and HA option. Even with a 1 year reservation @40% discount they'd be paying (list price) around $350k/month. For that amount you can rent a dedicated Oracle/ExaData cluster with 192 cores! That's got all kinds of fancy hardware optimizations like a dedicated intra-cluster replication network, RDMA between nodes, predicate pushdown etc. It's going to perform better, and have way more features that would relieve their operational headache. Even if it was a lot more expensive, OpenAI is opportunity-cost constrained right now, not capital constrained. All time their devs are finding complicated ways to not put things in their melting database is time lost to competitors.

joshsm5 a month ago | parent [-]

Exadata would definitely be an option and it has it's pros (I use it at work but that decision wasn't mine). Some of the features you forgot to mention are: - The built in bloom filters for reducing number of blocks scanned for FTS

- Oracle (not specific to Exadata) supports creating reverse key indexes to help address high write loads causing contention on indexes, Postgres you can use a reverse function to reverse a string but not the actual byte representation so doesn't work on int/bigint types

Oracle doesn't come without it's drawbacks:

- Oracle is infamous for their licensing costs and audits, TDE for example I believe is a separate licensing cost despite that fact that for most companies encryption as rest is required

- You have to include the cost of DataGuard into your calculations for TCO because there isn't replication built in like Postgres provides for free to handle DR use cases

- You have to include the cost of Oracle's point in time archiving backups for restoration in case of failure for TCO, the Postgres community has come up with a few solutions for this that are free. At this point in time, I wouldn't consider only daily backups acceptable.

- Applications typically work with grapheme clusters, a user doesn't need nor care how many bytes the emojis they filled an input field is, you need to store it and user interfaces are going to enforce 1 emoji as 1 character generally. In existing databases clusters, a lot of places have not turned on extended strings so Oracle strugles to handle Unicode due to its limitation on VARCHAR2 of 4000 BYTES until you turn this on.

- Oracle completely ignores SQL standards isolation levels, whereas Postgres adheres to them pretty well (ignoring repeatable read interpretations which happens with other RDBMS as well)

> it can notice what queries are slow and add indexes for you, automatically

This potentially can cause performance differences in DEV/STG/PROD where an index gets created in STG but then isn't there in PROD. Because of this possibility, it would violate the change management processes where I work but likely wouldn't be an issue for OpenAI.

> Database nodes scale horizontally and elastically. There is not such a thing as a primary master and read replicas in an Oracle cluster

This comes at the cost of the possibility of cluster waits, and I've seen them happen for 10 minute time period and are difficult to track down or optimize for in my experience.

> Database clusters are HA. Node lists are automatically distributed to client drivers. If a node fails or you do a rolling upgrade on the cluster

Everywhere I have worked does not use the full Oracle driver which is what I believe is required for this. Using the full client requires organization coordination that sometimes doesn't exist or won't exist because not every team wants to use Oracle and DevOps doesn't want to support the multitude of ways to use different database drivers. I primarily work in Java and I've only ever seen the JDBC thin client used which as far as I know doesn't support this, you can create connection strings with a list of servers for production with failover to a list of servers in another region designated as your DR database, but I don't believe they're dynamic.

> You can multiplex transactions onto sessions and do db-side connection pooling

This is HIGHLY dependent on what programming language/framework you're using. Spring Boot for example uses Hikari and generally avoids using the internal driver pool which likely prevents this. I also believe using the reactive drivers is required to do this in Java in general. Using this directly means you have a hard dependency to the specific database. While it's rare to change database software it can happen, my work is undergoing a switch from MSSQL to Oracle.

I could be wrong on any of these, considering I have to use Oracle daily at work I'm more than happy to be wrong and learn something since I can apply it immediately.

mike_hearn a month ago | parent [-]

Thanks for the insight! Some of these things were new to me :)

I didn't mention audits and the like because OpenAI are in the cloud, and there are obviously no audits for managed databases. That's relevant for self-hosted setups.

Both options I specced out included a replica, iirc. At least that's what Microsoft's HA option appears to be (it doubles the price, indeed).

One of the problems in comparing with Postgres is people tend to advertise mutually exclusive options, sometimes in different comments by different people so it's not their fault. Anything that requires an extension might as well not exist for most Postgres users because they want to outsource DB management to the cloud, and there the cloud provider chooses the extensions not the user. Oracle doesn't have this problem because all the features are built-in and managed databases have access to everything.

Postgres offers strictly serializable isolation indeed, but IIUC it's basically a form of read locking so will tank performance.

Re: cluster waits. Multi-node clusters will inevitably have failure modes a single machine doesn't unfortunately, I'd prefer to always scale up first before scaling out. One of the advantages of a managed database is that this becomes the cloud provider's problem.

Re: node list distribution. I think that's what the SCAN feature does, which is supported in the thin driver: https://docs.oracle.com/en/database/oracle/oracle-database/2...

Yes, the advanced features are typically not exposed by generic middleware and you will have to change things to take advantage of them, but this is a general issue that affects all kinds of software. Features specific to the Linux kernel are often not exposed in Python or Java either, for instance. Some features are easily enabled by just dropping in UCP, which should be an easy upgrade from Hikari. Very new stuff like tx multiplexing requires more work by the app developer to exploit currently, but hopefully with time frameworks will catch up. Point is, the support is there if you need it. Compared to the gymnastics OpenAI are going through, it's easy stuff.

anarazel a month ago | parent | next [-]

> Postgres offers strictly serializable isolation indeed, but IIUC it's basically a form of read locking so will tank performance.

Postgres' SSI [1] does not block reads. If you have lots of transactions reading and updating a lot of rows the granularity of tracking will become coarser to keep memory usage in bound though.

[1] https://drkp.net/papers/ssi-vldb12.pdf

mike_hearn a month ago | parent [-]

You're right, I mis-remembered - Postgres calls them "SIREAD locks" but they aren't actually locks. The model is based on aborting transactions.

The issue with this - and I'm not saying it's a bad feature because it's not and I'd like to have support for every isolation level everywhere - is that most apps can't tolerate transaction aborts. They need to be written to expect it and loop. Looping can easily turn into livelock if you aren't careful and there aren't great tools or known best practices for handling such a system.

You also have to be careful because such transactions are only strictly serializable vs other transactions run at the same isolation level. So if you accidentally allow in a mix of transactions you are still exposed to isolation anomalies.

So it's a useful feature but not something most apps can easily port to.

jose_zap a month ago | parent | prev | next [-]

> Postgres offers strictly serializable isolation indeed, but IIUC it's basically a form of read locking so will tank performance.

It will not. We use it at work for all transactions and its is very performant.

joshsm5 a month ago | parent | prev [-]

Sorry, holiday, but I was referring to having a separate region for failover, which is called "Geo redundant backup" in Azure Database for Postgres. Whereas with Exadata you need an entirely separate Exadata rack in a different region with the licensing of DataGuard on top of it.

There are a lot of extensions that most cloud providers support, auto range partitioning isn't built into Postgres like Oracle but it's supported with an extension for example. Azure Database supports more than AWS last I checked. https://learn.microsoft.com/en-us/azure/postgresql/extension...

I will have to check with our DBAs if we use the scan feature or not, appreciate you pointing it out.

diggan a month ago | parent | prev [-]

> but it sounds a little bit nerve wracking.

As long as you're doing backups (you are doing backups, right?), and validate that those backups work (you are validating that those backups work, right?), what's making you nervous about it?

mmcnl a month ago | parent | next [-]

Doing backups and validating backups is very error-prone and time consuming. To me this reads as: "If you do all the hard complex work yourself, what's making you nervous about it?"

It's far easier to do backups and database hosting at scale. Database failures are rare, so it's this one-off situation that you have to be prepared for. That requires clearly defined processes, clearly defined roles and responsibilities, and most importantly: feedback from unfortunate incidents that you can learn from. All that is very hard to accomplish when you do self-hosting.

ownagefool a month ago | parent | next [-]

It's actually probably a more difficult problem at scale.

When you have a single smallish schema, you export, restore, and write automated tests that'll probably prove that backups in 10 minutes ( runtime, development time few days / weeks ). Either the transaction runs or errors, and either the test passes or not.

The problem when small is obviously knowledge, skills, and procedures.

Things like:

- What if the monitoring that alerts me that the backups are down, is also actually down. - What do you mean it's no longer "safe" to kubectl delete pvc --all? - What do you mean there's nobody around with the skills to unfuck this? - What do you mean I actually have to respond to these alerts in a timely manner?

The reality is, when the database is small, it typically doesn't cost a whole lot, so there's a lack of incentive to really tool and skill for this when you can get a reasonable managed service.

I typically have those skills, but still use a managed service for my own startup because it's not worth my time.

Once the bill is a larger than TCO of self-hosting you have another discussion.

diggan a month ago | parent | prev [-]

> Doing backups and validating backups is very error-prone and time consuming

Right, but regardless of using a managed database service or self-hosted database, this is something you probably are doing anyways, at least for a production service with real users. Sure, the managed service probably helps with you with the details of how the backup is made, where it's stored, and how the restore process happens, but you still need to validate your backups and the rest, so replicate that process/experience with your self-hosted setup and you're nearly there.

mewpmewp2 a month ago | parent | prev | next [-]

Yeah, that's what is a bit odd to me. I feel like AWS and everything like that is much more of a black box compared to some things like Postgres that is so fully tested, proven to be reliable, etc.

tux3 a month ago | parent | prev | next [-]

I think managing stateless infrastructure is much easier, if anything goes haywire you can expect a readiness probe to fail, k8s quietly takes down the instance, and life continues with no downtime.

It is also perfectly possible to roll your own highly-available Postgres setup, but that requires a whole another set of precise configuration, attention to details, caring about the hardware, occasionally digging into kernel bugs, and so forth that cloud providers happily handle behind the scene. I'm very comfortable with low-level details, but I have never built my own cloud.

I do test my backups, but having to restore anything from backups means something has gone catastrophically wrong, I have downtime, and I probably have lost data. Everything to prevent that scenario is what's making me sweat a little bit

lossolo a month ago | parent | next [-]

> occasionally digging into kernel bugs

No, it doesn't. I've been self-hosting a multi-node, highly available, and fault-tolerant PostgreSQL setup for years, and I've never had to go to that level. After reading your whole post, I'm not sure where you're getting your information from.

tux3 a month ago | parent [-]

Horror stories stick with me more than success stories, but I'm happy to take the feedback. I'm glad it went well for you, that's a small update for me.

JimBlackwood a month ago | parent | prev [-]

> occasionally digging into kernel bugs

Haha, been there! We recently had outages on kube-proxy due to a missing `—set-xmark` option in iptables-restore on Ubuntu 24.04.

On any stateful server we always try to be several major versions behind due to issues like above - that really avoids most kernel bugs and related issues.

anal_reactor a month ago | parent | prev | next [-]

In Spain when you want to travel by high-speed train, you need to though security check, just like at an airport. Do the security checks make sense? No. But nobody wants to be the politician that removes the security checks, and then something bad happens. So the security checks stay.

fipar a month ago | parent | prev | next [-]

Backups with periodic restore validation (which is not trivial) are a must, but don’t make database maintenance any less nerve wracking.

Sure, you won’t lose data, but the downtime …

evantbyrne a month ago | parent | prev [-]

How do you prefer to collect backups when self-hosting postgres?

zie a month ago | parent | next [-]

We use barman too, but we do hourly and daily restores into different database instances.

So for example our prod db is tootie_prod We setup another instance the restores from barman every hour and renames the db to tootie_hourly.

We do the same thing daily.

This means we have backup copies of prod that are great for customer service and dev troubleshooting problems. You can make all the changes you want to _daily or _hourly and it will all get erased and updated in a bit.

Since _hourly and _daily are used regularly, this ensures that our backups are working too, since they are now a part of our daily usage to ensure they never break for long.

JimBlackwood a month ago | parent [-]

Hey, this is a pretty neat idea! I might just use this :)

zie a month ago | parent [-]

Please do!

JimBlackwood a month ago | parent | prev [-]

Not OP, but:

Barman on the host with a cronjob for physical backups and as archive/restore command for wal archiving and point in time recovery.

Another cronjob for logical backups.

They all ship to some external location (S3/SFTP) for storage.

I like the above since it adds minimal complexity, uses mainly native postgres commands and gives pretty good reliability (in our setup, we’d lose the last few minutes of data in the absolute worst case).