Remix.run Logo
sgarland 2 days ago

> joining through a junction table for a single multi-valued data field

I may be misunderstanding you, but to me it sounds like you're conflating domain modeling with schema modeling. If your domain is like most SaaS apps, then Phone, Email, Address, etc. are probably all attributes of a User, and are 1:N. The fact that multiple Users may share an Address (either from multiple people living together, or people moving) doesn't inherently mean you have an M:N relationship that you must model with schema. If you were using one of those attributes as an identity (e.g. looking up a customer by their phone number), that still doesn't automatically mean you have to model everything as M:N - you could choose to accept the possibility of duplicates that you have to deal with in application code or by a human, or you could choose to create a UNIQUE constraint that makes sense for 99% of your users (e.g. `(phone_number, deactivated_at)` enforces that a phone number is only assigned to one active user at a time), and find another way to handle the rare exceptions. In both cases, you're modeling the schema after your business logic, which is IMO the correct way to do so.

I apologize if I came across as implying that any possible edge case means that you must change your schema to handle it. That is not my design philosophy. The schema model should rigidly enforce your domain model, and if your domain model says that a User has 0+ PhoneNumber, then you should design for 1:N.

> And if the field is 1:1... or even 90% or 95% 1:1... do you really need a separate table just so you don't store a NULL in a column? You're not going to be eliminating nulls from your queries. They'll be full of LEFT JOINs everywhere; three-valued logic isn't going anywhere.

If the attribute is mostly 1:1, then whether or not you should decompose it largely comes down to semantic clarity, performance, and the possibility of expansion.

This table is in 3NF (and BCNF, and 4NF):

    CREATE TABLE User (
      id INT AUTO_INCREMENT PRIMARY KEY,
      name VARCHAR(255) NOT NULL,
      email VARCHAR(254) NOT NULL,
      phone VARCHAR(32) NULL
    );
So is this:

    CREATE TABLE User (
      id INT AUTO_INCREMENT PRIMARY KEY,
      name VARCHAR(255) NOT NULL,
      email VARCHAR(254) NOT NULL,
      phone_1 VARCHAR(32) NULL,
      phone_2 VARCHAR(32) NULL,
    );
Whereas this may violate 3NF depending on how you define a Phone in your domain:

    CREATE TABLE User (
      id INT AUTO_INCREMENT PRIMARY KEY,
      name VARCHAR(255) NOT NULL,
      email VARCHAR(254) NOT NULL,
      phone_1 VARCHAR(32) NULL,
      phone_1_type ENUM('HOME', 'CELL', 'WORK') NOT NULL,
      phone_2 VARCHAR(32) NULL,
      phone_2_type ENUM('HOME', 'CELL', 'WORK') NOT NULL,
    );
If a Phone is still an attribute of a User, and you're not trying to model the Phone as its own entity, then arguably `phone_1_type` is describing how the User uses it (I personally think this is a bit of a stretch). Similarly, it can be argued that this design violates 1NF, because `(phone_n, phone_n_type)` is a repeating group, even if you've split it out into two columns. Either way, I think it's a bad design (adding two more columns that will be NULL for most users to support a tiny minority isn't great, and the problem compounds over time).

> If it takes 20 joins just to return basic information, you're going to run into performance problems as well as usability problems. If 18 of those joins are to describe fidelity that you don't even need?

The only times I've seen anything close to that many joins are:

1. Recreating a denormalized table from disparate sources (which are themselves often not well-constructed) to demonstrate that it's possible. 2. Doing some kinds of queries in MySQL <= 5.7 on tables modeling hierarchical data using an adjacency list, because it doesn't have CTEs. 3. When product says "what if we now supported <wildly different feature from anything currently offered>" and the schema was in no way designed to support that.

Even with the last one, I think the most I saw was 12, which was serendipitous because it's the default `geqo_threshold` for Postgres.