What Category Theory Teaches Us About DataFrames
39 points by knl
39 points by knl
It’s missing a lot to become a category theory. You need to obey certain composition laws, find objects and morfisms, see what it leads to. It looks like a type theory. Things get more interesting when you try to model the hom sets here, say, your hom set is a data migration for example. Once you start working out the commutative diagrams and finding what they restrict the morfisms to, that’s when you find a subset of whatever you are modeling that has something special arising only from composition laws.
Interesting. The article is mostly about the relational algebra subset of dataframe features, and has not yet gone far enough to describe dataframe-specific features categorically.
There’s a point where they break down an aggregation op into the composition of a collect and a map, where the cells in the intermediate table have a row type. I suspect that if heterogeneous row types are a thing then more of the dataframe ops could be based on relational primitives.
One thing I wondered about is why map is described as a dataframe feature not a relational feature. The link to Petersohn et al. is misdirected but I found a copy at https://arxiv.org/abs/2001.00888. The reason they give is (what I would call) a syntactic quibble about being able to treat a row as a single object or not.
A more fundamental difference is that dataframes are much more dynamically typed than relational tables. I wonder how useful it would be to distinguish the features that are useful for ingesting scruffy data (dynamic) from core analytics (static) and if it’s possible to do so and still fuse both stages into a single pass over big data.
Where's the category theory here