Remix.run Logo
tux3 a month ago

That sounds like chaos engineering with extra steps! Have the AI randomly tweak your DB, and see if your infra holds up? :)

mike_hearn a month ago | parent [-]

It's not AI based. "Autonomous" is just the term Oracle uses to communicate that the database has a ton of self-management features, and that it was a big area of focus over time. If you use the recent versions it's a much less admin-heavy database than is conventional for an RDBMS. For example it can notice what queries are slow and add indexes for you, automatically. That's ordinary statistics and planning, GOFAI if you like.

Disclosure: I have a part time job with Oracle in the research division, so I very much have conflicts of interest, but OpenAI should listen to me anyway (and this post is my own opinion, obviously). Postgres is great and I use it in personal projects, but the amount of effort our industry collectively puts into working around its limitations and problems is a bit concerning. According to the blog post, their "solutions" to the problems mostly involve just not using the database, which is no solution at all. They could just subscribe to a managed Oracle DB in Azure and every problem they're having would go away. The cost is surprisingly not much different to a managed Postgres. Specifically, in the current version:

- Database nodes scale horizontally and elastically. There is not such a thing as a primary master and read replicas in an Oracle cluster, although you can have read-through cache nodes (they are not replicas). And because of the unusual way Oracle clusters are integrated with the hardware there also isn't such a thing as replication lag within the cluster: writes are fully synchronous even when scaling horizontally and reads are always up to date.

- Database clusters are HA. Node lists are automatically distributed to client drivers. If a node fails or you do a rolling upgrade on the cluster, it's transparent to the client beyond a minor elevation in latency as they retry. You can not only do rolling upgrades but also roll them across major versions.

- The MVCC engine in Postgres is strange and unique. Other databases don't have concepts like vacuuming or attendent table/index bloat. This isn't unique to Oracle, but nonetheless switching to something else means a massive productivity drain gone, just like that.

- You can do schema changes and do them online. They say they don't allow users to create new tables which I'm sorry but that's ridiculous. It's not their fault but I'd consider a database to which you can't add tables to be basically broken. You can even do arbitrary schema changes online in Oracle because the db can create a new table and copy the data across to the new schema whilst doing incremental sync, then do an in-place rename at the end. That works under heavy load.

- You can disable indexes, marking them as invisible to the planner whilst still maintaining them.

- You can multiplex transactions onto sessions and do db-side connection pooling, without things like "bouncers", so idle connection management is less of an issue.

- You can derive 95/99th percentile latencies from v$active_session_history because the wait times for every SQL statement are available.

- You can audit schema changes along with everything else.

On pricing, at these scales it just doesn't matter. An ExaData cluster might actually be cheaper. I knocked up an estimate using Azure's pricing calculator and the numbers they provide, assuming 5TB of data (under-estimate) and HA option. Even with a 1 year reservation @40% discount they'd be paying (list price) around $350k/month. For that amount you can rent a dedicated Oracle/ExaData cluster with 192 cores! That's got all kinds of fancy hardware optimizations like a dedicated intra-cluster replication network, RDMA between nodes, predicate pushdown etc. It's going to perform better, and have way more features that would relieve their operational headache. Even if it was a lot more expensive, OpenAI is opportunity-cost constrained right now, not capital constrained. All time their devs are finding complicated ways to not put things in their melting database is time lost to competitors.

joshsm5 a month ago | parent [-]

Exadata would definitely be an option and it has it's pros (I use it at work but that decision wasn't mine). Some of the features you forgot to mention are: - The built in bloom filters for reducing number of blocks scanned for FTS

- Oracle (not specific to Exadata) supports creating reverse key indexes to help address high write loads causing contention on indexes, Postgres you can use a reverse function to reverse a string but not the actual byte representation so doesn't work on int/bigint types

Oracle doesn't come without it's drawbacks:

- Oracle is infamous for their licensing costs and audits, TDE for example I believe is a separate licensing cost despite that fact that for most companies encryption as rest is required

- You have to include the cost of DataGuard into your calculations for TCO because there isn't replication built in like Postgres provides for free to handle DR use cases

- You have to include the cost of Oracle's point in time archiving backups for restoration in case of failure for TCO, the Postgres community has come up with a few solutions for this that are free. At this point in time, I wouldn't consider only daily backups acceptable.

- Applications typically work with grapheme clusters, a user doesn't need nor care how many bytes the emojis they filled an input field is, you need to store it and user interfaces are going to enforce 1 emoji as 1 character generally. In existing databases clusters, a lot of places have not turned on extended strings so Oracle strugles to handle Unicode due to its limitation on VARCHAR2 of 4000 BYTES until you turn this on.

- Oracle completely ignores SQL standards isolation levels, whereas Postgres adheres to them pretty well (ignoring repeatable read interpretations which happens with other RDBMS as well)

> it can notice what queries are slow and add indexes for you, automatically

This potentially can cause performance differences in DEV/STG/PROD where an index gets created in STG but then isn't there in PROD. Because of this possibility, it would violate the change management processes where I work but likely wouldn't be an issue for OpenAI.

> Database nodes scale horizontally and elastically. There is not such a thing as a primary master and read replicas in an Oracle cluster

This comes at the cost of the possibility of cluster waits, and I've seen them happen for 10 minute time period and are difficult to track down or optimize for in my experience.

> Database clusters are HA. Node lists are automatically distributed to client drivers. If a node fails or you do a rolling upgrade on the cluster

Everywhere I have worked does not use the full Oracle driver which is what I believe is required for this. Using the full client requires organization coordination that sometimes doesn't exist or won't exist because not every team wants to use Oracle and DevOps doesn't want to support the multitude of ways to use different database drivers. I primarily work in Java and I've only ever seen the JDBC thin client used which as far as I know doesn't support this, you can create connection strings with a list of servers for production with failover to a list of servers in another region designated as your DR database, but I don't believe they're dynamic.

> You can multiplex transactions onto sessions and do db-side connection pooling

This is HIGHLY dependent on what programming language/framework you're using. Spring Boot for example uses Hikari and generally avoids using the internal driver pool which likely prevents this. I also believe using the reactive drivers is required to do this in Java in general. Using this directly means you have a hard dependency to the specific database. While it's rare to change database software it can happen, my work is undergoing a switch from MSSQL to Oracle.

I could be wrong on any of these, considering I have to use Oracle daily at work I'm more than happy to be wrong and learn something since I can apply it immediately.

mike_hearn a month ago | parent [-]

Thanks for the insight! Some of these things were new to me :)

I didn't mention audits and the like because OpenAI are in the cloud, and there are obviously no audits for managed databases. That's relevant for self-hosted setups.

Both options I specced out included a replica, iirc. At least that's what Microsoft's HA option appears to be (it doubles the price, indeed).

One of the problems in comparing with Postgres is people tend to advertise mutually exclusive options, sometimes in different comments by different people so it's not their fault. Anything that requires an extension might as well not exist for most Postgres users because they want to outsource DB management to the cloud, and there the cloud provider chooses the extensions not the user. Oracle doesn't have this problem because all the features are built-in and managed databases have access to everything.

Postgres offers strictly serializable isolation indeed, but IIUC it's basically a form of read locking so will tank performance.

Re: cluster waits. Multi-node clusters will inevitably have failure modes a single machine doesn't unfortunately, I'd prefer to always scale up first before scaling out. One of the advantages of a managed database is that this becomes the cloud provider's problem.

Re: node list distribution. I think that's what the SCAN feature does, which is supported in the thin driver: https://docs.oracle.com/en/database/oracle/oracle-database/2...

Yes, the advanced features are typically not exposed by generic middleware and you will have to change things to take advantage of them, but this is a general issue that affects all kinds of software. Features specific to the Linux kernel are often not exposed in Python or Java either, for instance. Some features are easily enabled by just dropping in UCP, which should be an easy upgrade from Hikari. Very new stuff like tx multiplexing requires more work by the app developer to exploit currently, but hopefully with time frameworks will catch up. Point is, the support is there if you need it. Compared to the gymnastics OpenAI are going through, it's easy stuff.

anarazel a month ago | parent | next [-]

> Postgres offers strictly serializable isolation indeed, but IIUC it's basically a form of read locking so will tank performance.

Postgres' SSI [1] does not block reads. If you have lots of transactions reading and updating a lot of rows the granularity of tracking will become coarser to keep memory usage in bound though.

[1] https://drkp.net/papers/ssi-vldb12.pdf

mike_hearn a month ago | parent [-]

You're right, I mis-remembered - Postgres calls them "SIREAD locks" but they aren't actually locks. The model is based on aborting transactions.

The issue with this - and I'm not saying it's a bad feature because it's not and I'd like to have support for every isolation level everywhere - is that most apps can't tolerate transaction aborts. They need to be written to expect it and loop. Looping can easily turn into livelock if you aren't careful and there aren't great tools or known best practices for handling such a system.

You also have to be careful because such transactions are only strictly serializable vs other transactions run at the same isolation level. So if you accidentally allow in a mix of transactions you are still exposed to isolation anomalies.

So it's a useful feature but not something most apps can easily port to.

jose_zap a month ago | parent | prev | next [-]

> Postgres offers strictly serializable isolation indeed, but IIUC it's basically a form of read locking so will tank performance.

It will not. We use it at work for all transactions and its is very performant.

joshsm5 a month ago | parent | prev [-]

Sorry, holiday, but I was referring to having a separate region for failover, which is called "Geo redundant backup" in Azure Database for Postgres. Whereas with Exadata you need an entirely separate Exadata rack in a different region with the licensing of DataGuard on top of it.

There are a lot of extensions that most cloud providers support, auto range partitioning isn't built into Postgres like Oracle but it's supported with an extension for example. Azure Database supports more than AWS last I checked. https://learn.microsoft.com/en-us/azure/postgresql/extension...

I will have to check with our DBAs if we use the scan feature or not, appreciate you pointing it out.