Joins are NOT Expensive
34 points by chenghiz
34 points by chenghiz
I haven’t even read the article, but the title triggered me. Again and again I had seen developers do two or more SQL queries and then a sort of weird join in code, rather than using a simple SQL join. Reducing the number of roundtrips to the DB is one of the crosses I bear, apparently.
I wish it was culture on here to flag comments that respond exclusively to clickbait titles of articles as off topic. It's lazy and very rarely generates any interesting discussion. On this article alone there are two top level comments that have next to nothing to do with the contents of the article, and they are the most upvoted! Might as well have discussions about AI summaries of articles next.
a join in software does theoretically permit you to size the databases independently though.
This has been useful in some limited cases in my career, but it adds other (major) problems like needing to somehow solve backup order precednece
"Joins are NOT expensive" is doing a lot of heavy lifting as a title. This is a conditional claim dressed up as a universal one.
They’re cheap when they shrink the problem and brutal when they expand it. "Joins are often cheaper than people think" might be better.
Yeah, it's an interesting article that I appreciate reading, but I don't like the title because it leaves out all the caveats:
Contrary to what he says, I think a lot of people are running analytical queries on Postgres because it's what they have. Probably not the people who hire specialists on data warehouses, but most people aren't hiring those specialists.
Indexes can typically be intersected in sublinear time, so I don't know why it would be slow.
The chart looks a lot worse than O(nlogn). Is DuckDB supposed to have O(n^2) runtime for the query select c1, ..., cn from tbl? Something seems off.
Yeah I would expect a multiway zip to be O(nk) (num rows * num cols) so the upward superlinear curve in the chart suggests to me that DuckDB has a suboptimal algorithm. I wonder if it works better when rows are being filtered, tho I struggle to imagine an algorithm that’s efficient at zipping a low-selectivity filter but inefficient at zipping a zero-selectivity lack-of-filter. Maybe it’s tuned for high-selectivity queries.
While this goes really into details, this IS THE THING. I know software architects who would do anything to avoid JOIN. Trying to reinvent things that are working for decades...
I also do not like the title, because it's not true in a broader sense. Sure, if the data set is large enough, and you're not leaving the relational paradigm behind, you might find this to be true. But, joins are a generic, heavyweight way to correlate and coalesce two data sets. When data gets large, it tends to be better to use something more suited to the problem, and have purpose-build code for correlating and coalescing that is tailored for the case.
That often motivates moving to a different kind of backing store, like a columnar database. But then those things will inevitably add a SQL interface, and end up with the same single sledgehammer for all your hammering needs.
I think this is the case when it's important for developers to understand what actually happens under the hood, because it's simpler than trying to remember and reason about myriads of observable outside effects (performance). Simplified model of a table and key/index structure is all that is needed to explain 99% of database behaviors, and is shorter than just this article alone.
Seriously, if you want to up your SWE skills by quite a lot with not a lot of effort, talk with your LLM about how databases work under the hood, how is each things stored, how are operations implemented, and maybe implement a most simple one yourself. Not a whole SQL parser and engine, but something that can store a few tables, even hardcoded, make a join, and implement couple of typical database operations on it. Insert, update, delete, index lookup, selects via: index update, join, etc. For more structured approach Designing Data-Intensive Applications is the go to book.
That was really interesting, and made me question some received wisdom that for data warehousing, of course you denormalize into big tables to avoid joins. I'll make sure to measure and evaluate if I ever need to design a database schema for analytical processing!