Things you didn't know about indexes
38 points by bugsmith
38 points by bugsmith
Used PostgreSQL for years and somehow never bumped into INCLUDE. This was the first explanation that made it click: columns in the index proper affect ordering and have to be maintained on writes; INCLUDE columns are just there to cover the query.
It's pretty new to me too, and I'm a long time Postgres user (although far from being any kind of true database engineer).
I have rarely been fortunate enough to be able to take advantage of index only scans. INCLUDE might make it more viable if it doesn't blow up the size of the index too much.
It's truly amazing how deep a tool like Postgres is. I still learn new things about it weekly. This is why I don't believe applications can ever swap out dependencies like this, their semantics bleed through to everything.
I recently learned you can also INCLUDE additional columns in the primary key index.
Example of syntax:
CREATE TABLE invoice (
PRIMARY KEY (invoice) INCLUDE (customer, status),
invoice bigint GENERATED ALWAYS AS IDENTITY,
customer bigint NOT NULL,
status text NOT NULL,
amount numeric NOT NULL
);
FYI almost all of this is applicable to SQLite too, except
That's interesting. I knew some of this also applied to SQLite (the only other database I regularly interact with) but wasn't sure to what extent.
Databases also keep metadata on how many times an index was used. It is a reasonable maintenance task to go through any index that is older than a few months and delete the ones that are never used. Indices are not free, but it’s often better to make a bunch and then delete the unused ones later than to make an index on a really big table.
I always wanted a way for postgres to warn you if it is not going to be able to do a covering index for a table. I don't know if it's possible or what it would take.
I am still convinced that a bunch of people who go to university learning a bunch of data structures would be immediately confronted by this when trying to write queries if they had to do the planning themselves.
People have an intuitive understanding of how trees work, and being presented by that would make it more or less immediately obvious that certain things would or would not work!
I do realize that the query planner has saved a lot of people a lot of pain, but when you're faced with a slow complex query.... being able to say "well let's make the query do exactly this" would be nice and lean into knowledge a lot of people working on these systems have.
(My understanding is that relational DBs used to force you to query plan and SQL was a response to that... but.... this time would be different maybe?)
My understanding is that relational DBs used to force you to query plan and SQL was a response to that... but.... this time would be different maybe?
It’s surprising to me that the ideas of relational algebra and SQL as a declarative interface to a database grew up together: there was not an intermediate stage where relations (as an abstract data model) were a thing, but SQL had not yet been invented.
I haven’t seen a good description of what OLTP was like before SQL. There wasn’t very much OLTP in the 1960s or 1970s: most data processing at that time was based on batch jobs, with evolutionary continuity back to the kinds of electromechanical punch card processing that happened in the first half of the 1900s. (Batch jobs are why retail payments took days to clear until well into this century. OLTP is a radical innovation that took decades to become normal; the web was the first thing that made OLTP accessible to everyone and now we expect retail payments to clear in seconds.)
The most prominent example of early OLTP was SABRE but it’s so old it must have been entirely bespoke.
Another place to look is the old IBM “access methods” which were incredibly low-level ways of laying out data on raw storage devices, without even a filesystem. For instance, I think ISAM is something like a table stored as a Btree ordered by its primary key.
As I understand it, if you were going to write a join in the 1970s, you would have to write, long hand, a nested loop iterating over the two tables, and if the representation of the tables changed (eg, from unsorted records to btree layout) you would have to rewrite the queries. There was no abstraction layer between data layout and querying. If there was a query plan it was a flow chart drawn on paper as part of the design and planning before the code was written.
The only way would simply to analyze the query. Maybe there are PG monitoring tools out there that analyze all queries?
Yeah but why wait until after the fact when you could tell Postgres up front you expect a covering index and it tells you this won't be. Again maybe that's not possible because of dynamic sized data where maybe whether it's covering or not is dependent on the data but except for that case I feel like it should be possible. And in the case where it is possible it seems like a reasonable use case where you have these fixed size fields in a table and expect a covering index.
With the exception of using INCLUDE, this should all be fairly common knowledge for people that use databases regularly. I suppose it's good to see these similar articles every few years so that new people get up to speed (pun intended).
I expected to see bitmap indexes mentioned, but then I saw that the article has a Postgres slant to it, so it it makes sense that they left it out as it isn't supported. I found that they were great in specific cases of very large tables in the past and I miss them since I switched to mostly PostgreSQL .