Structured Primary Keys
30 points by giacomo_cavalieri
30 points by giacomo_cavalieri
I've often pushed myself towards this style in my side projects for similar reasons but this is nicely articulated. I'm particularly motivated by thinking like "make invalid data unrepresentable" so making it impossible to associate an order with a shipping address belonging to a different customer has always felt like a good starting place (if that matches the domain, of course). The performance upsides here were quite secondary and not something I'd thought deeply about, so I appreciate learning more from this article :)
I'm a bit of a database noob — took a course on SQL and worked with simple databases but nothing fancy. I don't understand the point of this article; it seems to carefully explain everything except the main point, which it just glances over: where does the performance benefit of the structured primary key come from?
The only thing I can think of, here, is that the select filters on customer_id, and since this is now the first part of the primary key of order_lines, no table scan of order_lines is necessary any more as there is an index we can use. This significantly reduces the number of redundant rows accessed and is thus good for performance. But then, the same performance benefit could have been got by adding an index on order_lines (or, indeed, on order) on (customer_id) or possibly more columns. So, it seems that the article assumes no additional indexes, which sounds odd to me.
But then the article goes:
With a structured primary key, the orders table does not need the id column at all. This often spares an index too.
So apparently we do implicitly assume some relevant indexes were created! But then that invalidates my understanding of the performance benefit.
Please help me: why is the query faster in the updated schema?
(Side note: because of the fetch first 1 row only, timings depend hugely on precisely where the first eligible row resides in the table, so the benchmark is suspect in my mind just for that!)
So apparently we do implicitly assume some relevant indexes were created! But then that invalidates my understanding of the performance benefit.
If you create a surrogate key, say an autogenerated numeric id column, to use as the primary key, then in many DB engines there is an implicit index on the id column. But you also need to create an explicit index on any column(s) you'll commonly use in a WHERE clause. Such as (customer_id, placed).
If the primary key is (customer_id, placed), as in the example, then you get an implicit index on that and don't have to create a separate explicit index for it.
There's one more important concept, which is when the index includes all the columns (from that table) that you're SELECTing, so the query can be served entirely out of the index.
To see how this works, suppose we take the orders table from the article, and we want to run a query that returns a list of all orders placed by a particular customer. And for ease of explanation, let's say this is Postgres, because I know it does things this way (several other major DB engines do too).
For our first version suppose we have the surrogate numeric id primary key, which has an implicit index, but no other indexes. We run the query:
SELECT id, customer_id, placed
FROM orders
WHERE customer_id = 123;
This is going to be slow, because there's no relevant index the database can use. So let's add an index on customer_id, because we probably use that a lot for filtering, and re-run the query. Now it'll be faster because the set of matching rows can be determined by scanning the index. And it can get the value for the customer_id for those rows from the index. But it still has to go to the table's storage (the table's "heap", which is not the same as "the heap" in programming languages which make you think about stack versus heap) to get the id and placed columns' values. And that can still be an expensive operation: it's likely to be random-access I/O since the set of columns we want are probably not stored adjacent to each other in the table's heap.
But now switch to the (customer_id, placed) primary key, with an index on those two columns, and the query becomes:
SELECT customer_id, placed
FROM orders
WHERE customer_id = 123;
We still are able to get all the filtering from an index scan because, once again, all the columns in the WHERE clause are in the index. But we also can get the full results of the query from the index, too, without having to go do random-access I/O to the table heap, because all the columns in the SELECT clause are also in the index. This is the fastest option of all.
Some sources will call this "index-only scan" (which is how I learned it) because the query only ever has to scan the index, and nothing else. Other sources call it a "covering index" because the index "covers" all the columns needed by both the WHERE and the SELECT.
But then, the same performance benefit could have been got by adding an index on order_lines (or, indeed, on order) on (customer_id) or possibly more columns.
Just because you have indices on two columns, that doesn't mean you can combine the indices in a single query.
The details depends a bit on the database. In some databases, the primary key decides the way rows are laid out on disk, while other indices are specialized tables that map the indexed column to a primary key, which is then used for lookup.
Since indices are tables off their own, you don't necessarily get any benefits from looking up an extra index.
Now, if you have a structured primary key, and you have a query that matches both part of the key, then finding the first part of the key will reduce the amount of rows required to scan through for the second part.
I believe PostgreSQL will, in most cases, try to pick one single index for it's queries.
I hope I was able to explain that in a good way 😅
Thank you! Your reply makes perfect sense by itself, but unfortunately I don't see how it applies to the post. :P
I get that indexes cannot necessarily be combined; I think of a query plan as a little program that often looks like a couple of nested "for loops". For the original query, assuming order_lines has an index on customer_id, we get: lookup customer_id in index on order_lines, scan in result set to find product_id, read order_id field, then primary key lookup in orders by ol.order_id. (If that PK lookup fails, continue scanning.) This doesn't feel bad to me. If you had a (customer_id, product_id) index on order_lines you could even shortcut the scan, but that doesn't feel worthwhile assuming orders are not gigantic.
For the revised (schema and) query, we get: lookup customer_id in prefix of PK index on order_lines, scan in result set to find product_id, read fields, then do PK lookup in orders. (Again, if PK lookup in orders fails, continue scanning.) This is identical to the previous case, just the orders PK is a little larger. Alternatively you could start in orders and lookup customer_id in its PK index, scan over results, lookup the corresponding order_lines entry each time, and check product_id; but this should be slower (more index lookups).
I don't see any index incompatibilities or indeed any relevant differences at all between the two cases. I'm surely missing something though!
The only thing the article gives as an explanation is that a structured PK maintains "cohesion among all rows that belong to the same customer." If you bring in the customers table too, I can see how this might increase parallelism in query execution, but not a 10-fold speedup; and the query here doesn't even use 'customers' at all!
It’s faster because everything you need is now in the index, so you don’t need to look it up in orders. Also order_lines will now be sorted by order_placed per customer, so it saves the sort step too.
The query planner is probably smart enough to realize that it doesn’t have to look up anything in orders, even when the query specifies a join.