Remix.run Logo
da_chicken 2 days ago

I still see value in the numbering.

Breaking 1NF is essentially always incorrect. You're fundamentally limiting your system, and making it so that you will struggle to perform certain queries. Only break 1NF when you're absolutely 100% certain that nobody anywhere will ever need to do anything even slightly complex with the data you're looking at. And then, probably still apply 1NF anyways. Everyone that ever has to use your system is going to hate you when they find this table because you didn't think of the situation that they're interested in. "Why does this query use 12 CTEs and random functions I've never heard of and take 5 minutes to return 20,000 rows?" "You broke 1NF."

2NF is usually incorrect to break. Like it's going to be pretty obnoxious to renormalize your data using query logic, but it won't come up nearly as frequently. If it's really never going to come up that often in practical terms, then okay.

3NF and BCNF are nice to maintain, but the number of circumstances where they're just not practical or necessary starts to feel pretty common. Further, the complexity of the query to undo the denormalization will not be as obnoxious as it is for 1NF or 2NF. But if you can do it, you probably should normalize to here.

4NF and higher continue along the same lines, but increasingly gets to what feels like pretty arbitrary requirements or situations where the cost you're paying in indexes is starting to become higher than the relational algebra benefits. Your database disk usage by table report is going to be dominated by junction tables, foreign key constraints, and indexes, and all you're really buying with that disk space is academic satisfaction.

sgarland 2 days ago | parent [-]

> Your database disk usage by table report is going to be dominated by junction tables, foreign key constraints, and indexes, and all you're really buying with that disk space is academic satisfaction.

FK constraints add a negligible amount of space, if any. The indexes they require do, certainly, but presumably you're already doing joins on those FKs, so they should already be indexed.

Junction tables are how you represent M:N relationships. If you don't have them, you're either storing multiple values in an array (which, depending on your POV, may or may not violate 1NF), or you have a denormalized wide table with multiple attributes, some of which are almost certainly NULL.

Also, these all serve to prevent various forms of data anomalies. Databases must be correct above all else; if they're fast but wrong, they're useless.

da_chicken 2 days ago | parent [-]

> Junction tables are how you represent M:N relationships.

Yeah, the problem is that when you get to 4NF+, you're often looking at creating a new table joining through a junction table for a single multi-valued data field that may be single values a plurality or majority of the time. So you need the base table, the junction table that has at least two columns, and the actual data table.

So, you've added two tables, two foreign key constraints, two primary key indexes, potentially more non-clustered indexes... and any query means you need two joins. And data validation is hard because you need to use an anti-join to find missing data.

Or, you can go with an 1:N relationship. Now you have only one more table at the cost of potentially duplicating values between entities. But if we're talking about, say, telephone numbers? Sure, different entities might share the same phone number. Do you need a junction table so you don't duplicate a phone number? You're certainly not saving disk space or improving performance by doing that unless there's regularly dozens of individual records associated to a single phone number.

And if the field is 1:1... or even 90% or 95% 1:1... do you really need a separate table just so you don't store a NULL in a column? You're not going to be eliminating nulls from your queries. They'll be full of LEFT JOINs everywhere; three-valued logic isn't going anywhere.

> Databases must be correct above all else; if they're fast but wrong, they're useless.

Yeah, and if they're "correct" but you can't get it to return data in a timely manner, they're also useless. A database that's a black hole is not an improvement. If it takes 20 joins just to return basic information, you're going to run into performance problems as well as usability problems. If 18 of those joins are to describe fidelity that you don't even need?

mickeyp 2 days ago | parent | next [-]

Right. But faceting data is also part of what a good database designer does. That includes views over the data; materialisation, if it is justified; stored procedures and cursors.

I've never had to do 18 joins to extract information in my career. I'm sure these cases do legitimately exist but they are of course rare, even in large enterprises. Most companies are more than capable of distinguishing OLTP from OLAP and real-time from batch and design (or redesign) accordingly.

Databases and their designs shift with the use case.

da_chicken a day ago | parent [-]

> I've never had to do 18 joins to extract information in my career.

Really? You're not representing particularly complex entities with your data.

I work on a student information system. 18 joins isn't even weird. If I want a list of the active students, the building they're in, and their current grade level, that's a join of 8 tables right there. If I also want their class list, that's an additional 5 or 6. If you also want the primary teacher, add another 4. If you want secondary staff, that's another 5.

The whole system is only around 500 GB, but it's close to 2,000 tables. Part of the reason is tech debt archaic design from the vendor, but that's just as likely to reduce the number of tables as it is to increase them. The system uses a monolithic lookup table design, and some of the tables have over 300 columns. If they were to actually properly normalize the entire system to 3NF, I have no doubt that it would be in the hundreds of thousands of tables.

sgarland 2 days ago | parent | prev [-]

> joining through a junction table for a single multi-valued data field

I may be misunderstanding you, but to me it sounds like you're conflating domain modeling with schema modeling. If your domain is like most SaaS apps, then Phone, Email, Address, etc. are probably all attributes of a User, and are 1:N. The fact that multiple Users may share an Address (either from multiple people living together, or people moving) doesn't inherently mean you have an M:N relationship that you must model with schema. If you were using one of those attributes as an identity (e.g. looking up a customer by their phone number), that still doesn't automatically mean you have to model everything as M:N - you could choose to accept the possibility of duplicates that you have to deal with in application code or by a human, or you could choose to create a UNIQUE constraint that makes sense for 99% of your users (e.g. `(phone_number, deactivated_at)` enforces that a phone number is only assigned to one active user at a time), and find another way to handle the rare exceptions. In both cases, you're modeling the schema after your business logic, which is IMO the correct way to do so.

I apologize if I came across as implying that any possible edge case means that you must change your schema to handle it. That is not my design philosophy. The schema model should rigidly enforce your domain model, and if your domain model says that a User has 0+ PhoneNumber, then you should design for 1:N.

> And if the field is 1:1... or even 90% or 95% 1:1... do you really need a separate table just so you don't store a NULL in a column? You're not going to be eliminating nulls from your queries. They'll be full of LEFT JOINs everywhere; three-valued logic isn't going anywhere.

If the attribute is mostly 1:1, then whether or not you should decompose it largely comes down to semantic clarity, performance, and the possibility of expansion.

This table is in 3NF (and BCNF, and 4NF):

    CREATE TABLE User (
      id INT AUTO_INCREMENT PRIMARY KEY,
      name VARCHAR(255) NOT NULL,
      email VARCHAR(254) NOT NULL,
      phone VARCHAR(32) NULL
    );
So is this:

    CREATE TABLE User (
      id INT AUTO_INCREMENT PRIMARY KEY,
      name VARCHAR(255) NOT NULL,
      email VARCHAR(254) NOT NULL,
      phone_1 VARCHAR(32) NULL,
      phone_2 VARCHAR(32) NULL,
    );
Whereas this may violate 3NF depending on how you define a Phone in your domain:

    CREATE TABLE User (
      id INT AUTO_INCREMENT PRIMARY KEY,
      name VARCHAR(255) NOT NULL,
      email VARCHAR(254) NOT NULL,
      phone_1 VARCHAR(32) NULL,
      phone_1_type ENUM('HOME', 'CELL', 'WORK') NOT NULL,
      phone_2 VARCHAR(32) NULL,
      phone_2_type ENUM('HOME', 'CELL', 'WORK') NOT NULL,
    );
If a Phone is still an attribute of a User, and you're not trying to model the Phone as its own entity, then arguably `phone_1_type` is describing how the User uses it (I personally think this is a bit of a stretch). Similarly, it can be argued that this design violates 1NF, because `(phone_n, phone_n_type)` is a repeating group, even if you've split it out into two columns. Either way, I think it's a bad design (adding two more columns that will be NULL for most users to support a tiny minority isn't great, and the problem compounds over time).

> If it takes 20 joins just to return basic information, you're going to run into performance problems as well as usability problems. If 18 of those joins are to describe fidelity that you don't even need?

The only times I've seen anything close to that many joins are:

1. Recreating a denormalized table from disparate sources (which are themselves often not well-constructed) to demonstrate that it's possible. 2. Doing some kinds of queries in MySQL <= 5.7 on tables modeling hierarchical data using an adjacency list, because it doesn't have CTEs. 3. When product says "what if we now supported <wildly different feature from anything currently offered>" and the schema was in no way designed to support that.

Even with the last one, I think the most I saw was 12, which was serendipitous because it's the default `geqo_threshold` for Postgres.