Tools built on tree-sitter's concrete syntax trees
58 points by shapr
58 points by shapr
I’ve been thinking for a while about how great it would be to have code reviews with tree-sitter-based (or some other tech, I’m not picky) tooling - for things like being able to highlight pure-rename changes differently than actual behavior changes, or more intelligent “mark this code as reviewed” options.
Our bachelor project was exactly this though none of the reviewers were impressed by the idea at the time.
Do you have a link to your bachelor’s project? Was there a write-up?
I checked it and it’s private. I would need to ask to my friends to make it public. Basically it parses your Java source code and compares two ASTs at the repository level. All classes are compared between two versions. Then new methods shown as green, removed as red and updated as yellow. It doesn’t do AST diff at the function body level. Our focus was on the high level codebase structure. Function reorderings produce no diff at all for example.
But it’s really unfinished since we stopped working on it after the presentation. To make it usable in real life it needs a lot of work.
I feel the same way about everything I’ve written. Everything is unfinished and needs a lot of work.
I remember reading articles about this approach 15+ years ago, primarily related to eclipse/netbeans. It was called Code in Database back then. But those approaches were very tied to the IDE. From what I remember, there was little talk of how this would integrate with version control. Also this was back when Git existed but wasn’t the default. My memory is a bit hazy.
There wasn’t an ecosystem of tools built around open standards or common libraries like treesitter (and tangentially related LSP, and the ruff python parser). This is a great time to be a programmer.
You might look at https://difftastic.wilfred.me.uk/
It’s just the diff side of this, but might be a start.
And this is an interesting writeup on some approaches and research https://github.com/tobymao/sqlglot/blob/main/posts/sql_diff.md
Edit: Hah, serves me right for reading comments before the article :D
I worked on this exact problem for some time but didn’t manage to get something fully functional out of it in the time I had, so I shelved it. My proudest moment, though, was when I came up with its logo:
diff
dualTreesitter tooling is still very much an underexplored problem space, but since neovim supports treesitter natively, there are a couple of very useful plugins around already.
For example treesitter-textobjects enables editing directly based on treesitter nodes, e.g. swapping the order of function arguments and similar actions can now be made a single keystroke.
There is aerial.nvim for treesitter-based high-level navigation and folding, and you can even overwrite movement to navigate based on nodes instead of the line grid. Sometimes regex based search and replace is not powerful enough, so treesitter based structural search (similar to ast-grep mentioned in the article) might help out.
There is this site listing a bunch of treesitter plugins, but ultimately it feels like there is still lots of untapped potential.
It’s easy to get pigeonholed into thinking you have to write a tree-sitter tool that is general enough to work for every language with a tree-sitter grammar, but actually I find it’s nicer to use them to build language tooling for a single language, because the tree-sitter API is so nice and bindings exist for multiple languages (and grammars for many more). Often, the marquee parser for a language isn’t designed with external developer experience in mind, but tree-sitter is. For example, the tree-sitter-tlaplus grammar is used for:
i also posted about tbsp recently, it’s a DSL that abstracts visitors over tree-sitter trees.
i am presently hacking on code-navigation for tangled.sh that makes a best effort reference-definition relationship. it makes use of tree-sitter-graph internally, which is a wonderful DSL extend tree-sitter trees.
i am presently hacking on code-navigation for tangled.sh that makes a best effort reference-definition relationship. it makes use of tree-sitter-graph internally, which is a wonderful DSL extend tree-sitter trees.
Very cool! Are you using stack graphs, or are you building something custom? (Stack graphs has withered on the vine, unfortunately, since GitHub unshipped Precise Code Nav. But we had purposefully pulled TSG out as a separate project since “graph rewriting” on a TS syntax tree seemed more likely to be useful long-term.)
very happy to see you on this thread! i did try stack-graphs for code-nav (4 years ago maybe?) at a previous workplace, and my read on project was that it was too difficult to add support for a new language or improve existing stack-graph definitions (the typescript definitions are 6k lines of DSL code!). i suppose that is to be expected from the level of precision that stack-graphs attempt to get, but it requires the implementors to fully spec out language semantics in the TSG DSL.
the approach i have since landed on is much less precise, but still uses the same concepts you outline in your talk: the split between index time computation & query time computation is nearly identical (graphs are built per file). the TSG DSL is simply used to annotate the tree-sitter nodes with certain features: definition
, reference
and scope
.
so a file like so:
// foo.rs
fn add(x: i64, y: i64) -> i64 {
let sum = x + y;
sum
}
fn triple(x: i64) -> i64 {
let double = add(x, x);
let triple = add(double, x);
triple
}
would have a graph like so:
foo.scope {
definition add
definition triple
add.scope {
definition x
definition y
definition sum
reference x
reference y
reference sum
}
triple.scope {
definition x
definition double
definition triple
reference x
reference add
reference double
reference triple
}
}
it works pretty well for local variables and single-file navigation. in addition to resolving definitions via these graphs, i intend to perform a ctags-like text-based search for tokens.
a sample for a single rust file’s graph is here (needs plenty of work! closures result in incorrect navigation). unannotated tokens will still be clickable to perform across-the-project resolution eventually, with the aforementioned ctags-like fallback.
its unfortunate that GH pulled support for precise code-nav, but im glad projects like TSG and stack-graphs still continue to exist! most of the tbsp DSL is also based on the TSG evaluator. also super grateful for the work you and the GH team have done for tree-sitter, thanks!
it was too difficult to add support for a new language or improve existing stack-graph definitions (the typescript definitions are 6k lines of DSL code!). i suppose that is to be expected from the level of precision that stack-graphs attempt to get, but it requires the implementors to fully spec out language semantics in the TSG DSL.
Yep that tracks! These days I think that TSG was the wrong choice for stack graphs. The fact that it’s declarative and “emergent” means that it’s much harder to understand whether/how the stack graph rules correctly implement the language semantics. In retrospect I think an imperative Lua tree-walk / visitor-style interface would have been easier to work with.
the TSG DSL is simply used to annotate the tree-sitter nodes with certain features: definition, reference and scope
This looks great! It looks like it fits in as a nice middle ground between stack graphs and the simple textual symbol index that we did for Fuzzy Code Nav. (That just used tree-sitter queries, and so it didn’t really extract the scoping structure like your TSG solution does.)
I’m very curious to know what you will think about BABLR, which replaces the Tree-sitter system of declarative grammars implemented with code generation with a system of imperative parsers written in plain JS generator functions: https://github.com/bablr-lang/language-en-es3/blob/3bb8c6d6fa338e4c70a4b8e81793af78081d0183/lib/regex.macro.js
I had not seen that before, thanks for the link! I see a few possible comparison points here relative to tree-sitter that you might be thinking of.
Declarative vs imperative style for writing the grammar: This is an interesting one, which tbh probably is largely an aesthetic question — that is, has tree-sitter’s declarative style been a hindrance, either in terms of capability or ease of use? I honestly don’t have a good intuition for that!
Language ecosystem: It looks like the downstream tools that are using BABLR grammars are also written in JS, is that right? From a cursory look, it looks like there are plans to compile the parsers down to a VM, which you could implement in many languages. Is that right? Technically that’s true for tree-sitter too (modulo C lexers) though no one has ever invested in trying to implement the parsing engine in any other language, probably because there’s a good enough story for using a C library in most other languages. (I worry it would be too limiting if every BABLR tool has to embed a JS engine.)
The resulting syntax tree: It looks like you’re trying to innovate on the level of detail that’s included in the resulting syntax trees. (tree-sitter’s aren’t really pure ASTs or CSTs) Is that something that can’t be done in tree-sitter, or is it that grammar authors just haven’t included that yet? (The holes in particular are very interesting, and I’m trying to figure out whether you could get that from a tree-sitter parser)
Apologies if I’m missing any of the intent behind the project, this is what came to mind while looking through it for the first time.
You have very sharp eyes, as I thought you would : )
In my time in industry I was working deep on web frontend, so I took most of the ideas from web browsers. The idea of CSTML is to be to IDEs what HTML is to web browsers. We figure if we have a standard serialization format for trees, there can exist many different concrete embedding systems – essentially many different memory layouts unified behind a textual abstraction. So yes, I hope and expect that implementations of our trees and APIs will spring up in other languages like C++, Rust, and Python. The framework has some pretty exciting possibilities because it lends itself naturally to translating itself into other languages.
We’re definitely trying to innovate on the level of detail in the trees. Tree-sitter trees are designed to work in reference to a text document, but our trees are designed to be the document. Tree-sitter trees can be made to contain all CST-relevant data, but it requires extra work. CSTML trees are always guaranteed to have the source code they were parsed from as a kind of “inner text”. That makes them easy to handle: you don’t need any special knowledge to be able to render the embedded structure as flat syntax. In fact like with HTML the tag tree is meant to be invisible – a behind-the-scenes way to embed metadata into textual code for richer rendering.
Tree-sitter has some concept of gaps in trees, but it assumes the textual layer has no gaps, so you’d be stuck having to rewrite the core, the parsers, and the consuming tools to get support, which in my mind seems unlikely for now.
I’ve been thinking the golden age of tree-sitter-based tools is just around the corner for long enough to know I’m not much of a prophet, but I did build a couple of things myself:
alter table
.https://github.com/BrianHicks/tree-grepper is another one, I’ve used this to find some syntactic pattern that I want to rewrite (so sort of like using it as a linter). Worked great!
Neovim has such good treesitter support which has resulted in more reliable text objects and allows more precise manipulation without resorting to regexes etc. which is such a relief. If you’re a neovim user and don’t use treesitter yet, watch this neovimconf talk for an “automation” example.
this just reminded me of cursorless and how amazing it is. It seems too hard for me to use voice commands to edit code but i’m sure others could adapt new workflows.
I am a fan of Topiary forcode formatting. This single formatter can provide configuration in a way that most formatters don’t: tab indentation for accessibility.
I’d love to use combobulate
to get the same structural editing for C# and similar that I’m used to having for Lisp, but at work, where I use C#, I also use Windows, and it is functionally impossible to get tree-sitter grammars working with Emacs on Windows.
Semi-on-topic: Does anybody know of a tool that has VS Code support and allows me to drag-drop functions to reorder them within the file?
I’ve tried a bunch of sitter based search engines and they seem very clunky to me. Very simple queries just don’t return all expected results. I realize that this is because the tree doesn’t match how I read code, but query “foo()” not matching code “bar.foo()” is a no-go for me. So close yet so far.