Joins are NOT Expensive

34 points by chenghiz


stig

I haven’t even read the article, but the title triggered me. Again and again I had seen developers do two or more SQL queries and then a sort of weird join in code, rather than using a simple SQL join. Reducing the number of roundtrips to the DB is one of the crosses I bear, apparently.

jonathannen

"Joins are NOT expensive" is doing a lot of heavy lifting as a title. This is a conditional claim dressed up as a universal one.

They’re cheap when they shrink the problem and brutal when they expand it. "Joins are often cheaper than people think" might be better.

marginalia

Indexes can typically be intersected in sublinear time, so I don't know why it would be slow.

srcreigh

The chart looks a lot worse than O(nlogn). Is DuckDB supposed to have O(n^2) runtime for the query select c1, ..., cn from tbl? Something seems off.

radim

While this goes really into details, this IS THE THING. I know software architects who would do anything to avoid JOIN. Trying to reinvent things that are working for decades...

viega

I also do not like the title, because it's not true in a broader sense. Sure, if the data set is large enough, and you're not leaving the relational paradigm behind, you might find this to be true. But, joins are a generic, heavyweight way to correlate and coalesce two data sets. When data gets large, it tends to be better to use something more suited to the problem, and have purpose-build code for correlating and coalescing that is tailored for the case.

That often motivates moving to a different kind of backing store, like a columnar database. But then those things will inevitably add a SQL interface, and end up with the same single sledgehammer for all your hammering needs.

dpc_pw

I think this is the case when it's important for developers to understand what actually happens under the hood, because it's simpler than trying to remember and reason about myriads of observable outside effects (performance). Simplified model of a table and key/index structure is all that is needed to explain 99% of database behaviors, and is shorter than just this article alone.

Seriously, if you want to up your SWE skills by quite a lot with not a lot of effort, talk with your LLM about how databases work under the hood, how is each things stored, how are operations implemented, and maybe implement a most simple one yourself. Not a whole SQL parser and engine, but something that can store a few tables, even hardcoded, make a join, and implement couple of typical database operations on it. Insert, update, delete, index lookup, selects via: index update, join, etc. For more structured approach Designing Data-Intensive Applications is the go to book.

rudis

That was really interesting, and made me question some received wisdom that for data warehousing, of course you denormalize into big tables to avoid joins. I'll make sure to measure and evaluate if I ever need to design a database schema for analytical processing!