Remix.run Logo
KronisLV 6 days ago

> Schema design should be flexible, because once you have thousands or millions of records, it can be an enormous pain to change the schema. However, if you make it too flexible (e.g. by sticking everything in a “value” JSON column, or using “keys” and “values” tables to track arbitrary data) you load a ton of complexity into the application code (and likely buy some very awkward performance constraints). Drawing the line here is a judgment call and depends on specifics, but in general I aim to have my tables be human-readable: you should be able to go through the database schema and get a rough idea of what the application is storing and why.

I’m surprised that the drawbacks of EAV or just using JSON in your relational database don’t get called out more.

I’d very much rather have like 20 tables with clear purpose than seeing that colleagues have once more created a “classifier” mechanism and are using polymorphic links (without actual foreign keys, columns like “section” and “entity_id”) and are treating it as a grab bag of stuff. One that you also need to read the application code a bunch to even hope to understand.

Whenever I see that, I want to change careers. I get that EAV has its use cases, but in most other cases fuck EAV.

It’s right up there with N+1 issues, complex dynamically generated SQL when views would suffice and also storing audit data in the same DB and it inevitably having functionality written against it, your audit data becoming a part of the business logic. Oh and also shared database instances and not having the ability to easily bootstrap your own, oh and also working with Oracle in general. And also putting things that’d be better off in the app inside of the DB and vice versa.

There are so many ways to decrease your quality of life when it comes to storing and accessing data.

dondraper36 6 days ago | parent | next [-]

There's a great book SQL Antipatterns, by Bill Karwin where this specific antipattern is discussed and criticized.

That said, sometimes when I realize there's no way for me to come up even with a rough schema (say, some settings object that is returned to the frontend), I use JSONB columns in Postgres. As a rule of thumb, however, if something can be normalized, it should be, since, after all, that's still a relational database despite all the JSON(B) conveniences and optimizations in Postgres.

quibono 5 days ago | parent | prev [-]

> storing audit data in the same DB and it inevitably having functionality written against it, your audit data becoming a part of the business logic

What's the "proper" way to do this? Separate DB? Separate data store?

KronisLV 5 days ago | parent | next [-]

Typically you want your audit/log data to be immutable and kept in an append only data store.

Whether that's a typical relational DB or something more specialized (like a log shipping solution) that's up to you, but usually it would be separate from the main DB.

If you need some functionality that depends on events that have taken place, you probably want to store information about those events in the main data store (but only what's needed for that functionality, not a list of all mutations done to a table like audit data might include).

In general, it's nice to have such a clear boundary of where the business domain ends and where the aux. stuff to help you keep it running goes - your logs and audit data, analytics and metrics, tracing spans and so on.

Edit: as a critique of my own arguments here, I will admit that doing the above can introduce some complexity and that in simpler systems it might be overkill. But I've seen what happens when everything is just in one huge DB instance, where about 90% of the overall schema size is literally due to records in those audit tables and everyone is surprised why opening the "History" tab for a record takes a while (and anything else that references said history, e.g. visibility of additional records), and it's not great either.

zbentley 5 days ago | parent | prev | next [-]

I was also onboard with GP’s comment until I got to this part.

Audit data in the same DB is great, because it can be written transactionally for relatively cheap (multi table updates, triggers, actual transactions with multiple writes, etc).

After that, sure, ship it elsewhere and prune the audit tables if you like. But having the audit writes go directly to Kafka or whatnot is a pain because it requires your client logic to a) have a distributed publish-event transaction (which can work in this case more easily than distributed transactions in general with careful use of idempotency keys, read back, or transactional outboxes, but it’s complicated and requires everyone writing to auditable tables to play along), and b) reduces your reliability because now the audit store or its message queue needs to be online for every write as well as your database.

And there’s plenty of good reasons for business logic to use (only for reads) audit data. What else would business logic do if an audit table existed and there was a business need to e.g. show customers a change history for something? Build another redundant audit system instead?

tremon 5 days ago | parent | prev [-]

Separate schema, no read permissions for the application identity is sufficient. It's not like "separate db" makes it magically unqueryable.