syntax highlighting with tree-sitter
25 points by fanf
25 points by fanf
+1 to writing your own SSG for personal blog, in my experience it is way easier to maintain.
You might want to add something like this to your style:
Right now, on mobile, overflowing code blog generates a horizontal scroll bar for the entire website, bot just for the code block itself.
I am somewhat skeptical about using TreeSitter for syntax highlighting in many contexts: for high quality stuff, you need access to language semantics, for something that just looks good, a bunch of regexes would be way simpler.
I am somewhat skeptical about using TreeSitter for syntax highlighting in many contexts: for high quality stuff, you need access to language semantics, for something that just looks good, a bunch of regexes would be way simpler.
Interesting. What approach would you recomend for suitable for all the high-quality stuff including language semantics? (I had thought to start with tree-sitter and translate its output into my own IR, and the fact that tree-sitter is incremental is attractive)
Ask the compiler in general! Probably via LSP semantic highlighting these days. That way, you’ll get non-syntactic information as well:
Yes, indeed!
I notice my question was poorly worded. If I am writing the a compiler, would you consider tree-sitter a poor fit for the first stage of the front end? I was attracted to having a specified grammar, being able to reuse the tree-sitter grammar in other situations (zed, etc), and the incrementality (though I realize I would need to design the following stages to be incremental to benefit from that) but maybe its a bad approach?.
I’d use tree sitter in a compost only if:
Otherwise, I’d follow more traditional approach: hand write the parser (preferable), or use some LR generator.
The main thing you lose with tree sitter is grammar. It is GLR so you won’t know which language you are actually parsing (though, I think I can be convinced that giving ambiguity error at runtime is the right choice).
Other than that, I think you’ll incur complexity cost, relative to a hand-written parser (not sure about complexity cs yacc&the gang).
You don’t need incrementality. You’ll likely need error resilience, but that’s not hard to do yourself.
As well as @matklad’s answer, I like @ltratt’s advice https://tratt.net/laurie/blog/2023/why_we_need_to_know_lr_and_recursive_descent_parsing_techniques.html (41 comments on lobsters)
Short version is, it depends on whether you are implementing a parser for an existing grammar, implementing a parser for a language without an explicit grammar, or designing a grammar and parser for a new language.
The lack of scroll bars on code blocks is intentional, I prefer being able to zoom in and out instead.
Re. alternatives to tree-sitter, if you know of something with support for a good selection of languages that I can plug into a Rust static site generator in a few dozen lines of code, I’m all ears.
I used TreeSitter for Rego and Lua in the CHERIoT book (C/C++ were done with libclang). I was amazed at how absolutely terrible the APIs are. They make it really hard to extract semantic structure. You can use the highlighter to mark up regions with colours, but I don’t want that, I want CSS to do that by marking up regions based on semantics. There’s absolutely no attempt to build an ontology, every parser defines its own types (as strings). In one grammar all punctuation is a punctuation token type, in another each punctuation character is a distinct type that is the character name. There’s no kind of subtyping, so you can’t have something that is an identifier and a class name and a reference to another declaration, it’s just one of those (or, if you’re lucky, three tree nodes each more or less generic, but there’s absolutely no consistency between them). Trying to use TreeSitter to consistently highlight two languages is an exercise in frustration.
I’m not sure what the APIs are like, but I think this is what the queries system is for (Helix C example). The grammar builds a language-specific tree and the queries group things semantically and possibly in different ways based on whether you’re highlighting, indenting, navigating, etc.
Yeah, those kinds of problems became apparent to me from the second language.
There’s no kind of subtyping, so you can’t have something that is an identifier and a class name and a reference to another declaration, it’s just one of those
I haven’t looked into the guts of the grammars yet, but the highlighting queries can do some simple subtyping, like the @type
vs @type.builtin
example in my article.
I also use syntect in my rust based ssg - it’s very easy to use with comrak.
https://git.sr.ht/~metasyn/memex/tree/main/item/src/main.rs#L314
Sorry, didn’t realize that whole-site scroll bar was intentional!
if you know of something with support for a good selection of languages that I can plug into a Rust static site generator in a few dozen lines of code,
I remember looking into this a couple of years back (not sure for what purpose), and, at that time I came to the conclusion that no, every complete Rust option is very complicated :P
For my thing, I use server-side hljs, which is nice as I can easily extend it for bespoke languages: https://github.com/matklad/matklad.github.io/blob/6833f64d8d1ea81da90be90e206d7a2498d850ba/src/highlight.ts#L16
But, yeah, that’s very much not Rust!
I generally try to keep code snippets narrow enough that they work OK on narrow screens, but sometimes they sprawl. It’s a difficult trade-off; I’m clearly in the minority, though! Dunno why putting code into matchboxes is so popular, seems like style over substance to me :-P
By the way, re. your neat jsx revamp, you have a bug in your feed generator: it prints [object Object]
😱
By the way, re. your neat jsx revamp, you have a bug in your feed generator: it prints [object Object] 😱
Lol, thanks! As I’ve said recently, I am not a fan of putting HTML into XML, as that’s hard to get right :P
I generally try to keep code snippets narrow enough that they work OK on narrow screens, but sometimes they sprawl. It’s a difficult trade-off;
Yeah, my least-bad solution here is:
This person compared Treesitter and something called Syntect, and chose the latter:
https://www.jonashietala.se/blog/2024/03/19/lets_create_a_tree-sitter_grammar/
Because of these issues I’ll evaluate what highlighter to use on a case-by-case basis, with Syntect as the default choice.
https://news.ycombinator.com/item?id=39762495
(my summary of Treesitter feedback, which may or may not be related to the blog use case - https://news.ycombinator.com/item?id=39783471 )
I just recently added syntax highlighting for a refresh of my own blog and I ended up using syntect, mostly because that is what typst does which I use as my markdown alternative.
But the end result is around 5 lines that does the actual highlighting (and about 200 lines of rendering typst to html and rewriting the html to add syntax highlights.)
Source code: https://git.sr.ht/~erk/hs/tree/main/item/postomatic/src/lib.rs#L109
How it looks: https://beta.erk.dev/2023/07/25/BadAppleFont.html
(the link changed to https://beta.erk.dev/2023/07/25/anifont.html but I cannot edit it by now)
In my wiki, I use syntect
for code highlighting. It’s a solid library that I can recommend.
for my own website i wound up just writing my own custom command-line tool wrapping syntect
bc i couldn’t find a tool i liked, and bc a command line tool is in theory generally useful and in practice really easy to tie into my site’s build process (it’s a static site made using php and a little gmake. slightly cursed but very fun and versatile)
the command-line tool in question: https://git.sr.ht/~alterae/htmlight
more or less entirely undocumented bc it’s an internal tool with a very simple interface
ironically my site is NOT built using the ssg i wrote in rust a few years ago
I agree with a lot of your gripes about tree-sitter, I use it for syntax highlighting on my own site too and have a love-hate relationship with it…
I thought the current tree-sitter crates were incompatible with each other!
This is a big pain point for me as my SSG uses about 10 different grammars, many not under the tree-sitter
org. In practice, this means lots of vendoring as the various grammar crates haven’t yet updated to support newer versions of tree-sitter.
I was offended that
tree-sitter-highlight
seems to expect me to hardcode a list of highlight names, without explaining where they come from or what they mean.
Here’s my understanding on how all of tree-sitter
grammars, query files, and tree-sitter-highlight
work with each other:
highlights.scm
file is coupled to a specific tree-sitter grammar, and defines highlight names for specific nodes in the syntax tree.tree-sitter-highlight
is a subset of the highlight names in the highlights.scm
file, tailored to only the syntax you are interested in actually highlighting.For example, if you have no desire to highlight comments, you can omit "comment"
from the list passed to .configure
, which means your callback will not be called for comment highlights defined in the highlights.scm
file.
This can be useful for reducing the amount of HTML generated especially if you only want minimal highlighting.
I’m not entirely satisfied with the level of detail and consistency provided by the tree-sitter language grammars and highlight queries. For instance, in the CSS above the class names and property names have the same colour because the CSS highlights.scm gives them the same highlight name.
The quality of syntax highlighting you get from tree-sitter-highlight
depends pretty much exclusively on highlights.scm
. The default bundled highlights assign the same highlight name to class names and property names, so that’s what you get in the end.
In my blog implementation, I use nvim-treesitter’s queries, which I believe are higher quality than the most grammar-bundled highlight queries. However, there are quirks you’ll have to address (that I accomplished with hacky automation) because of stuff like editor-specific query predicates (eg. lua-match
in neovim).
Is this wrapper a standalone binary? I have been hoping to find a tree-sitter-grammar→HTML converter akin to highlight
(cat file | highlight --syntax=syntax
). I am still using Prism, but it’s hard to tell the status of the project, & I have even considered piping to Neovim with the nvim-treesitter
[sic] & having that spit out HTML from the syntax highlighting.
I use autumn[1] for my personal site, which is for Elixir, which is based on the author’s Rust/C[2] library that uses tree-sitter for syntax highlighting.
That looks fairly nice!
Also it clarifies what I was wondering about why it seems hard to find themes for styling code on web pages. The ecosystem is (understandably) mostly interested in themes for editors such as Sublime Text or vim, and a highlighter typically decorates code with concrete colours. If there’s an indirection layer between syntactic categories and colours, it usually isn’t exposed in the highlighter’s output.
What I was hoping for (and what I implemented) is a highlighter that produces labelled output (spans with classes) that can be themed with CSS. In particular that supports both light mode and dork mode without re-running the highlighter over the code.
Microsoft VS Code seems to use numbered class names to indicate the syntactic category of each token, so even a web-based editor isn’t using straightforward CSS for its theme system, instead it compiles proprietary themes to CSS. Disappointing!
Nice, I’ve been meaning to write something like this for Hugo for quite a while. It would probably have to be a markdown preprocessing step since Hugo doesn’t seem to allow plugins by design (it wants to stay fast, is the design rationale I think).
That’s pretty nice. I like that the devtools then provides you some pretty-meaningful hints as to what’s going on in the code if you really need help because you don’t know the highlighted language