The hidden cost of PostgreSQL arrays

28 points by sjamaan

simonw

When creating a table, you might expect strict typing. That's true for everything — except the array dimensions. You might think integer[][] enforces a 2D matrix. Except it does not. The [] syntax is effectively syntactic sugar. PostgreSQL does not enforce the number of dimensions of sub-arrays at the schema level at all by default.

That surprised me. I wonder why PostgreSQL is designed that way - I normally expect it to be more strict than that.

sjamaan

The documentation seems to hint that it's an implementation limitation that may be addressed later:

However, the current implementation ignores any supplied array size limits, i.e., the behavior is the same as for arrays of unspecified length.

The current implementation does not enforce the declared number of dimensions either. Arrays of a particular element type are all considered to be of the same type, regardless of size or number of dimensions. So, declaring the array size or number of dimensions in CREATE TABLE is simply documentation; it does not affect run-time behavior.

(emphasis mine)

doctor_eval

It depends on your use case, but arrays can be a very significant performance benefit, simply because a link table requires not just extra columns (the foreign key at a minimum, but arrays also have ordinality), but a pkey index and possibly an fkey index, depending on how you structure the table. If the linked table is sparse then you can have very significant read amplification relative to arrays.

I’ve used arrays extensively over the years and there are definite performance advantages, despite not being “relational” in a dogmatic sense. They aren’t for every use case and should be deployed with care, but link tables can suffer from the 1+N problem, and arrays can help you avoid it.

sjamaan

Absolutely. In one early project where I started to use Postgres for storing simulation results, I ran into disk space issues. Instead of storing rows of essentially (simulation_nr, time, variable, value) I stored (simulation_nr, variable, values[]). Since it was a simulation, time here was simply an atomically increasing integer which would be the position in the array.

The row overhead (24 bytes, IIRC) of storing individual values was quite high, and then there was the timestamp in each row. Of course, querying wasn't very fast either. With arrays, the queries flew and fetching all the measurements for a given variable to display in a plot was almost instantaneous because they were stored neatly together.
- doctor_eval
  
  Yes! Data locality is a performance superpower!
BinaryIgor

Yes, but also keep in mind that if your arrays are rather large, they will be TOAST-ed and stored on multiple disk pages; that probably nullifies any possible performance gained over plain-old linked table.

But, as long as your arrays have mostly a few items and the whole tuple still fits int a single Postgres page - they're quite amazing; especially combined with the fact you can GIN-index them!

mqudsi

PostgreSQL 14 introduced LZ4 as an alternative

Postgres 16 introduced ZSTD, which is even better. (Though LZ4 might still be a better choice for smaller TOAST columns.)

Signez

Great overview of all the pitfalls in using arrays in Postgres (something that I found very tempting sometimes, but always bit me at some point).

I am surprised though by the mention of 2024 in the beginning of the section about vectors in a blogpost that seems to have been be published yesterday… is it a rerun? :)

buffalo7

Text flow on that page is totally broken on mobile, I presume because the code blocks don’t have a fixed with and scroll bars. Too bad.

BinaryIgor

Whether you use a distinct integer[] type or a JSON list [1, 2, 3], you are making the exact same architectural decision: you are prioritising locality over normalisation.

A great way to put it!