Subtypes and status-dependent data: pure relational approach
12 points by squadette
12 points by squadette
Very clever strategy, thanks for a clear writeup. I think I found two resources that may be what you're looking for?
This strategy appears in Joe Celko's 2012 book Trees and Hierarchies in Sql for Smarties chapter 10.2, starting page 205:
Another version of the same approach [to subclassing] uses a two-part key in the subclass tables where one column is a constant that tells you what the table contains. Let's use abbreviation codes for “in state”, “out of state”, and “foreign” students.
It also appears in his 2008 Thinking in Sets section 6.4.1 page 100. There's an equivalent example, but the explanation is a single sentence:
I then use a compound key (gtin, media_type) and a constraint in each subclass table to ensure that the media_type is locked and agrees with the Publications table.
I read these books years ago but definitely didn't remember this design. At first glance it looks like an excellent tool for modeling sum types in SQL databases.
Also, I read the linked Subtype.pdf and I don't see the support for part of the design. On p3, there are two examples of subtyping and both say that a record must have a subtype. Both have an example transactions of inserting a record with a subtype, but I don't see that either enforces that a subtype record is created. Maybe there's a sql function I'm not aware (often the case!) but I expect a query could insert directly to the parent table and not insert to the subtype table, violating the design goal of requiring a subtype.
Like, in your blog's example, you could insert into images... with status set to invalid without inserting into invalid_images. This is expected and seems part of the design for pending and completed (presumably to indicate no animals were detected), but looks like a bug for invalid, unless you meant to use the absence of an error to indicate the error was unknown, or triggered an error in the error-handling code.
I don't think anyone would intentionally do this, and the first Celko example is a better demonstration because the full text is clear that it expects the subtypes to be total, but the goal of this modeling seems to be to 'make illegal states unrepresentable' using standard SQL. So either I'm missing something in SQL or this strategy has a gap that it can't require that a parent record be subtyped.
but I don't see that either enforces that a subtype record is created.
Yes. I think that in this design it's impossible to statically ensure that for invalid images there is always an error message. It is possible if you put error_message into the main table (with obvious CHECKs), of course.
The design is impossible because you cannot have circular FK references. I mean you can have but generally a validity of constraints is checked after each statement, and not at the end of transaction, so you cannot insert into two tables.
But the other part: a list is even more interesting. Last year I announced a relational modeling challenge that accidentally required a non-empty list. I received a truly ingeniuos reply that actually solves this problem: https://minimalmodeling.substack.com/p/relational-modeling-challenge/comment/176834259 It's like Lisp-style lists I think. I'm going to clean it up and discuss in February because I am amazed.
generally a validity of constraints is checked after each statement, and not at the end of transaction, so you cannot insert into two tables.
I'd imagine support would vary by database, but at least in Postgres and SQLite, it looks like it's possible to defer the enforcement of foreign keys by marking specific foreign key constraints as DEFERRABLE INITIALLY DEFERRED in the schema, or else by deferring enforcement on a per transaction basis (postgres, sqlite).
Fun list definition to another puzzle, thanks.
You make a good point that requiring a subtype is isomorphic to circular FKs. Which I guess means that we could use the standard solution there, too: create a new table to hold the semantic concept of "image" that holds the image_id other tables want to FK to. The new table has FKs pointing at image_parent_id and image_subtype_id, and also a required subtype. This is unattractive because you can't just say not null on the two FKs, you'd have to write a check like image_parent_id && (status = 'pending' || (status == 'invalid' && image_subtype_id) || status = 'completed'). Unfortunately this would open the possibility of orphan records in image_parent and image_subtype if those records were created without an image, and in a practical sense you'd still need a way to communicate into the future that other tables must not join to image_parent or image_subtype but the new top-level image. This would close off the possibility of a missing subtype but is not really a satisfying approach given the complexity and new problems.
yeah, relational model is disappointingly weak. My pet theory is that relational theory was developed at uniquely unfortunate time when programming language theory was itself not developed enough. If it were designed when say algebraic types were more widely understood there is a chance that we'd have something more powerful yet not much more complicated.
Thank you so much! I'll add both references to the list of literature.
I think I'll change the term from "constant" to immutable in the text, I think it better reflects what's going on. Because 'constant' may mean a constant value, but immutable means that the column cannot be changed.
I like "locked and agrees" also.