Remix.run Logo
wvbdmp 16 hours ago

Eh, sometimes you need a quick fix and it’s just extremely concise and readable. I’ll take an INNER JOIN over EXISTS (nice but insanely verbose) or CROSS APPLY (nice but slow) almost every time. Obviously you have to know what you’re dealing with, and I’m mostly talking about reporting, not perf critical application code.

Distinct is also easily explained to users, who are probably familiar with Excel’s “remove duplicate rows”.

It can also be great for exploring unfamiliar databases. I ask applicants to find stuff in a database they would never see by scrolling, and you’d be surprised how many don’t find it.

Sesse__ 16 hours ago | parent [-]

The less verbose way of doing semijoins is by an IN subquery.

wvbdmp 16 hours ago | parent [-]

>subquery

>less verbose

Well…

In any case, it depends. OP nicely guarded himself by writing “overusing”, so at that point his pro-tip is just a tautology and we are in agreement: not every use of DISTINCT is an immediate smell.

Sesse__ 15 hours ago | parent [-]

What do you mean? Here are your real alternatives for doing a semijoin (assuming ANSI SQL, no vendor extensions):

  SELECT * FROM t1 WHERE EXISTS ( SELECT * FROM t2 WHERE t2.x = t1.x );
  SELECT * FROM t1 WHERE x IN ( SELECT x FROM t2 );
  SELECT * FROM t1 JOIN ( SELECT DISTINCT x FROM t2 ) s1 USING (x);
Now tell me which one of these is the less verbose semijoin?

You could argue that you could fake a semijoin using

  SELECT DISTINCT * FROM t1 JOIN t2 USING (x);
or

  SELECT * FROM t1 JOIN t2 USING (x) GROUP BY t1.*;
but it doesn't give the same result if t1 has duplicate rows, or if there is more than one t2 matching t1. (You can try to fudge it by replacing * with something else, in which case the problem just moves around, since “duplicate rows” will mean something else.)
wvbdmp 14 hours ago | parent [-]

No, sorry, you’re certainly correct, I just meant that any subqueries are generally crazy verbose. And then you usually want additional Where clauses or even Joins in there, and it starts to stop looking like a Where clause, so I’m often happy when I can push that logic into From.

Sesse__ 14 hours ago | parent [-]

Yes, I would certainly prefer if you could write

SELECT * FROM t1 SEMIJOIN t2 USING (x);

although it creates some extra problems for the join optimizer.

Little_Kitty 12 hours ago | parent [-]

It's great being able to use an any join (and the counterpart anti join) in Clickhouse to deal with these operations.