Remix.run Logo
torginus 5 days ago

Are you sure about this?

Let's say you run a webshop and have two tables, one for orders with 5 fields, one for customers, with 20 fields.

Let's say you have 10k customers, and 1m orders.

A query performing a full join on this and getting all the data would result in 25 million fields transmitted, while 2 separate queries and a client side manual join would be just 5m for orders, and 200k for customers.

jameshart 5 days ago | parent | next [-]

If you need all the orders and all the customers sure.

But usually you need some of the orders and you need the customer info associated with them. Often the set of orders you’re interested in might even be filtered by attributes of the customers they belong to.

The decision of whether to normalize our results of a database query into separate sets of orders and customers, or to return a single joined dataset of orders with customer data attached, is completely orthogonal to the decision of whether to join data in the database.

bambax 5 days ago | parent | prev | next [-]

One way to think about 1-to-many relationships is to think in the other way, "many-to-one". You don't join the orders to the customers, you join the customers to the orders (enrich the orders with customer information).

It's very natural to want customer information when querying an order, and if you have a view like orders_with_customer_info, you get that with zero effort when querying that view by order id.

You also get consolidated data (orders by customer) by doing

  select count(*), sum(amount) from orders_with_customer_info group by customer_id
which I think is pretty straightforward.
5 days ago | parent | prev | next [-]
[deleted]
digitalPhonix 5 days ago | parent | prev | next [-]

What sort of application is regularly doing a query for “all data”?

aembleton 5 days ago | parent [-]

Client report generation.

tremon 5 days ago | parent | next [-]

1) As soon as reporting requirements get serious, you build a data warehouse. Because odds are, the client will want to combine data from multiple systems in their reports anyway. If not today, then they will tomorrow.

2) such reports never need all the data, it's mostly about top N volume queries or month-over-month performance data. When a reporting application does query all the data, it's because it's building its own data warehouse so the query usually happens only once per day, at a specific time, which means the load is entirely predictable.

digitalPhonix 5 days ago | parent | prev [-]

Even then, DB side aggregation and joins would be beneficial.

I’m asking specifically about “all data” unfiltered, un-aggregated, un-joined as the parent’s analysis was on.

nicoburns 5 days ago | parent | prev | next [-]

These days you can use JSON aggregation in the database to avoid returning duplicate data in what would otherwise be large joins.

dondraper36 5 days ago | parent | prev | next [-]

What I particularly like about the comments in this thread is how it proves that everything is a trade-off :)

valiant55 5 days ago | parent | prev [-]

My rule of thumb is if it's a 1:1 relationship, use a join. If it's 1:M, separate queries.