Lobsters: Datalog in Rust

One of the big improvements for Datalog-shaped program analysis is supporting semilattice types, ime. egglog and ascent have them, and even Souffle has them now, and they’re a natural fit for “iterate to fixpoint” style abstract interpretation that comes up a lot. Even things like the points-to analysis in the article I usually think of in terms of saturating the points-to set instead of adding tuples over free variables, although they can obviously express the same thing.

The demand driven part is neat, and something that I never actually looked into in Datalog engines before! Practically every post I’ve seen previously have only talked about how to more quickly saturate the dataset, not how to more efficiently compute only the queried rows. I’ve used CodeQL in the past, which compiles down to Souffle, and had a lot of issues with its query planner ballooning intermediate table results computing facts that aren’t actually relevant, or that should be pared down by later negative rules, and having a more intelligent Datalog engine I think would be a lot better for program analysis than a faster one. CodeQL has the Klee star operator as a first-class language primitive which is cute, but I’ve never even seen the “transpose” operator before!

I’m also curious why no Datalog engines ever talk about doing perfect optimal multijoins as part of their query planning. Most rules have conjunctive queries that result in joins and semijoins (or, in this case, equijoins) over multiple tables and something like Yannakaki’s algorithm seems like it would be better in an algorithmic complexity case, especially if your query plan is able to re-use the intermediate join tables across multiple rules, or do the incremental updating thing using them for semi-naive evaluation.

One of the big improvements for Datalog-shaped program analysis is supporting semilattice types …

Yup. They seem pretty easy, but I’m probably misunderstanding how they work. Differential dataflow already supports semigroups as a “value”; e.g. each update is key -> val and usually the key is “row” and the value is “integer” with “+”, but it can be any type with associative op. When you add a Mul implementation to the semigroup it gains an interpretation in recursive computations with joins. I’m more into the non-monotonic computations though, e.g. supporting input retractions, which fouls up a bunch of the lattice-stuff, would be my guess.

The demand driven part is neat, and something that I never actually looked into in Datalog engines before!

Yeah it’s great. If I had to guess, it represents the difference between “Datalog engine that exists to compete in academic benchmarks” and “Datalog engine intended for use by humans or as an API”, where the latter clearly need support for the demand transform.

I’m also curious why no Datalog engines ever talk about doing perfect optimal multijoins as part of their query planning.

They do! At least, here’s me talking about them, and their implementation in datafrog. They are really quite important in Datalog as you end up with a lot more cyclic rules than in conventional SQL. Subfields like motif finding (and e.g. e-matching in egg) can benefit a lot from these, and there (imo) quite intuitive once you get your head around the attribute-at-a-time thing. The Yannakakis algorithm only applies for acyclic queries, afaiu, which may explain why it doesn’t show up as much as we’d all like.

optimal multijoins

they implemented one for egg/egglog, where it seems to have worked well. and the same people wrote this paper also

There’s discussion of optimal worst case join in this 2016 Frank post on differential Datalog: https://github.com/frankmcsherry/blog/blob/master/posts/2016-06-21.md#future-directions

Some prior work, in Datalog & Rust, that unfortunately is no longer maintained: https://github.com/mozilla/mentat

I have to be missing something obvious but the body of tri() doesn’t appear to form a 3-cycle?

I read it as graph edge()s being usually undirected / bidirectional, as usual for a graph? I agree that the motivating example of control flow in a program isn’t undirected, but…