Systematic design of multi-join GROUP BY queries
12 points by squadette
12 points by squadette
As far as “value multiplication” or duplicate rows, I’d say 90% (?) of the time I’ve made this mistake it’s either
t.id <> t2.id
instead of t.id < t2.id
The “overview” is quite good. I don’t see the problem with a having
in a subquery, as long as you know you’re inner join
ing it, but I might be misunderstanding. Implicit filtering is a really good term, I’ll have to use that. I usually think of it as inner join
s finally meeting left join
s.
Thank you!
I don’t see the problem with a having in a subquery
It may not always be a problem, but here is a scenario: <unique_key query> LEFT JOIN <subquery HAVING foo>”.
Imagine that <unique_key_query> returns 10 rows, and returns 3 rows. Due to LEFT JOIN between them, you will have 10 rows, but only 3 will have the correct number. Other 7 would have NULL, and if you apply COALESCE, you will probably have 0, which is wrong.
Now that I’m writing this, I suddenly see your point. If I read you correctly, you’re saying that we should just have
<unique_key_query> INNER JOIN <subquery HAVING foo>, instead of LEFT JOIN.
This will do the implicit filtering, and then you won’t need the outer WHERE. Right?
If yes, I agree. I need to rewrite this part, mentioning that you have to make sure that you use INNER JOIN in this case specifically, if that is your intention. Thank you,
My general concern is that people sometimes write slightly misleading constructs that just do not bite them in a specific situation, but then as they do some change, those constructs will suddenly wreak havoc.
Yeah I think you got it right. Maybe I just lean towards inner join
more, is all. As long as it all comes out the same, ok.
I think it’d matter more if we were actually filling a temp table w/ the having
-filtered rows (table would be smaller?), but if it’s just CTEs getting expanded it probably wouldn’t matter.
Actually since I published this and thought a bit more I went into full-scale Pepe Silvia mode and I think I understand why both people and ChatGPT tend to write multi-join queries by default. I’ll write a follow-up investigation, and hopefully would be able to better explain why you need to be aware of this framework even if your queries are not that complicated.
why both people and ChatGPT tend to write
I hope that’s not too off topic and I feel like I am stating the obvious, but ChatGPT does write what people write, since that’s what it’s trained on.
Yes, and that’s what makes it so useful. We can interrogate it and ask it why it thinks a certain way. It learned based on the same teaching corpus as people use to learn, so we can find flaws in the teaching corpus, and teach better.