syntax highlighting with tree-sitter

26 points by fanf

matklad

+1 to writing your own SSG for personal blog, in my experience it is way easier to maintain.

You might want to add something like this to your style:

https://github.com/matklad/matklad.github.io/blob/6833f64d8d1ea81da90be90e206d7a2498d850ba/content/css/main.css#L101

Right now, on mobile, overflowing code blog generates a horizontal scroll bar for the entire website, bot just for the code block itself.

I am somewhat skeptical about using TreeSitter for syntax highlighting in many contexts: for high quality stuff, you need access to language semantics, for something that just looks good, a bunch of regexes would be way simpler.

andyferris

I am somewhat skeptical about using TreeSitter for syntax highlighting in many contexts: for high quality stuff, you need access to language semantics, for something that just looks good, a bunch of regexes would be way simpler.

Interesting. What approach would you recomend for suitable for all the high-quality stuff including language semantics? (I had thought to start with tree-sitter and translate its output into my own IR, and the fact that tree-sitter is incremental is attractive)
- matklad
  Ask the compiler in general! Probably via LSP semantic highlighting these days. That way, you’ll get non-syntactic information as well:
  
  Which bindings are mutable
  
  What kind of thing a name refers to (struct vs trait, local vs field, method vs extension)
  
  Shadowing disambiguation
  - andyferris
    
    Yes, indeed!
    
    I notice my question was poorly worded. If I am writing the a compiler, would you consider tree-sitter a poor fit for the first stage of the front end? I was attracted to having a specified grammar, being able to reuse the tree-sitter grammar in other situations (zed, etc), and the incrementality (though I realize I would need to design the following stages to be incremental to benefit from that) but maybe its a bad approach?.
    
    matklad
    
    I’d use tree sitter in a compost only if:
    
    this is explicitly a quick experimental thing, where I don’t care about quality
    
    or if I need to support many languages.
    
    Otherwise, I’d follow more traditional approach: hand write the parser (preferable), or use some LR generator.
    
    The main thing you lose with tree sitter is grammar. It is GLR so you won’t know which language you are actually parsing (though, I think I can be convinced that giving ambiguity error at runtime is the right choice).
    
    Other than that, I think you’ll incur complexity cost, relative to a hand-written parser (not sure about complexity cs yacc&the gang).
    
    You don’t need incrementality. You’ll likely need error resilience, but that’s not hard to do yourself.
    
    andyferris
    
    Thank you
    
    fanf
    
    As well as @matklad’s answer, I like @ltratt’s advice https://tratt.net/laurie/blog/2023/why_we_need_to_know_lr_and_recursive_descent_parsing_techniques.html (41 comments on lobsters)
    
    Short version is, it depends on whether you are implementing a parser for an existing grammar, implementing a parser for a language without an explicit grammar, or designing a grammar and parser for a new language.
    
    andyferris
    
    That was a really helpful article, thank you!
    
    fanf
    
    The lack of scroll bars on code blocks is intentional, I prefer being able to zoom in and out instead.
    
    Re. alternatives to tree-sitter, if you know of something with support for a good selection of languages that I can plug into a Rust static site generator in a few dozen lines of code, I’m all ears.
    
    david_chisnall
    
    I used TreeSitter for Rego and Lua in the CHERIoT book (C/C++ were done with libclang). I was amazed at how absolutely terrible the APIs are. They make it really hard to extract semantic structure. You can use the highlighter to mark up regions with colours, but I don’t want that, I want CSS to do that by marking up regions based on semantics. There’s absolutely no attempt to build an ontology, every parser defines its own types (as strings). In one grammar all punctuation is a punctuation token type, in another each punctuation character is a distinct type that is the character name. There’s no kind of subtyping, so you can’t have something that is an identifier and a class name and a reference to another declaration, it’s just one of those (or, if you’re lucky, three tree nodes each more or less generic, but there’s absolutely no consistency between them). Trying to use TreeSitter to consistently highlight two languages is an exercise in frustration.
    
    jamesw
    
    I’m not sure what the APIs are like, but I think this is what the queries system is for (Helix C example). The grammar builds a language-specific tree and the queries group things semantically and possibly in different ways based on whether you’re highlighting, indenting, navigating, etc.
    
    fanf
    
    Yeah, those kinds of problems became apparent to me from the second language.
    
    There’s no kind of subtyping, so you can’t have something that is an identifier and a class name and a reference to another declaration, it’s just one of those
    
    I haven’t looked into the guts of the grammars yet, but the highlighting queries can do some simple subtyping, like the @type vs @type.builtin example in my article.
    
    metasyn
    
    I also use syntect in my rust based ssg - it’s very easy to use with comrak.
    
    https://git.sr.ht/~metasyn/memex/tree/main/item/src/main.rs#L314
    
    matklad
    
    Sorry, didn’t realize that whole-site scroll bar was intentional!
    
    if you know of something with support for a good selection of languages that I can plug into a Rust static site generator in a few dozen lines of code,
    
    I remember looking into this a couple of years back (not sure for what purpose), and, at that time I came to the conclusion that no, every complete Rust option is very complicated :P
    
    For my thing, I use server-side hljs, which is nice as I can easily extend it for bespoke languages: https://github.com/matklad/matklad.github.io/blob/6833f64d8d1ea81da90be90e206d7a2498d850ba/src/highlight.ts#L16
    
    But, yeah, that’s very much not Rust!
    
    fanf
    
    I generally try to keep code snippets narrow enough that they work OK on narrow screens, but sometimes they sprawl. It’s a difficult trade-off; I’m clearly in the minority, though! Dunno why putting code into matchboxes is so popular, seems like style over substance to me :-P
    
    By the way, re. your neat jsx revamp, you have a bug in your feed generator: it prints [object Object] 😱
    
    matklad
    
    By the way, re. your neat jsx revamp, you have a bug in your feed generator: it prints [object Object] 😱
    
    Lol, thanks! As I’ve said recently, I am not a fan of putting HTML into XML, as that’s hard to get right :P
    
    I generally try to keep code snippets narrow enough that they work OK on narrow screens, but sometimes they sprawl. It’s a difficult trade-off;
    
    Yeah, my least-bad solution here is:
    
    wrap prose at 60
    
    limit code at 80, so there’s guaranteed absence of scrollbars on a wide-enough (eg, ipad) screen, but keep it narrower if feasible
    
    for literal mobile view, accept that code isn’t going to be particularly readable anyway, and add scrollbars to code, so that the entire thing doesn’t have a horizontal scrollbar, which breaks vertical scrolling of non-code.
    
    andyc
    
    This person compared Treesitter and something called Syntect, and chose the latter:
    
    https://www.jonashietala.se/blog/2024/03/19/lets_create_a_tree-sitter_grammar/
    
    Because of these issues I’ll evaluate what highlighter to use on a case-by-case basis, with Syntect as the default choice.
    
    https://news.ycombinator.com/item?id=39762495
    
    (my summary of Treesitter feedback, which may or may not be related to the blog use case - https://news.ycombinator.com/item?id=39783471 )
    
    valdemar
    
    I just recently added syntax highlighting for a refresh of my own blog and I ended up using syntect, mostly because that is what typst does which I use as my markdown alternative.
    
    But the end result is around 5 lines that does the actual highlighting (and about 200 lines of rendering typst to html and rewriting the html to add syntax highlights.)
    
    Source code: https://git.sr.ht/~erk/hs/tree/main/item/postomatic/src/lib.rs#L109
    
    How it looks: https://beta.erk.dev/2023/07/25/BadAppleFont.html
    
    valdemar
    
    (the link changed to https://beta.erk.dev/2023/07/25/anifont.html but I cannot edit it by now)
    
    BenjaminRi
    
    In my wiki, I use syntect for code highlighting. It’s a solid library that I can recommend.
    
    alterae
    
    for my own website i wound up just writing my own custom command-line tool wrapping syntect bc i couldn’t find a tool i liked, and bc a command line tool is in theory generally useful and in practice really easy to tie into my site’s build process (it’s a static site made using php and a little gmake. slightly cursed but very fun and versatile)
    
    the command-line tool in question: https://git.sr.ht/~alterae/htmlight
    
    more or less entirely undocumented bc it’s an internal tool with a very simple interface
    
    ironically my site is NOT built using the ssg i wrote in rust a few years ago
    
    kosayoda
    
    I agree with a lot of your gripes about tree-sitter, I use it for syntax highlighting on my own site too and have a love-hate relationship with it…
    
    I thought the current tree-sitter crates were incompatible with each other!
    
    This is a big pain point for me as my SSG uses about 10 different grammars, many not under the tree-sitter org. In practice, this means lots of vendoring as the various grammar crates haven’t yet updated to support newer versions of tree-sitter.
    
    I was offended that tree-sitter-highlight seems to expect me to hardcode a list of highlight names, without explaining where they come from or what they mean.
    
    Here’s my understanding on how all of tree-sitter grammars, query files, and tree-sitter-highlight work with each other:
    
    The tree-sitter grammar that generates the syntax tree decides the name of the nodes and their relationships in the tree.
    
    A highlights.scm file is coupled to a specific tree-sitter grammar, and defines highlight names for specific nodes in the syntax tree.
    
    The “list of recognized highlight names” passed to tree-sitter-highlight is a subset of the highlight names in the highlights.scm file, tailored to only the syntax you are interested in actually highlighting.
    
    For example, if you have no desire to highlight comments, you can omit "comment" from the list passed to .configure, which means your callback will not be called for comment highlights defined in the highlights.scm file. This can be useful for reducing the amount of HTML generated especially if you only want minimal highlighting.
    
    I’m not entirely satisfied with the level of detail and consistency provided by the tree-sitter language grammars and highlight queries. For instance, in the CSS above the class names and property names have the same colour because the CSS highlights.scm gives them the same highlight name.
    
    The quality of syntax highlighting you get from tree-sitter-highlight depends pretty much exclusively on highlights.scm. The default bundled highlights assign the same highlight name to class names and property names, so that’s what you get in the end.
    
    In my blog implementation, I use nvim-treesitter’s queries, which I believe are higher quality than the most grammar-bundled highlight queries. However, there are quirks you’ll have to address (that I accomplished with hacky automation) because of stuff like editor-specific query predicates (eg. lua-match in neovim).
    
    toastal
    
    Is this wrapper a standalone binary? I have been hoping to find a tree-sitter-grammar→HTML converter akin to highlight (cat file | highlight --syntax=syntax). I am still using Prism, but it’s hard to tell the status of the project, & I have even considered piping to Neovim with the nvim-treesitter [sic] & having that spit out HTML from the syntax highlighting.
    
    fanf
    
    It’s symbiosisware right now, tied into the guts of my web site. But it’s like one page of code, it would be straightforward to turn it into a program in its own right.
    
    toastal
    
    No pressure, but would be cool :)
    
    xo
    
    I use autumn[1] for my personal site, which is for Elixir, which is based on the author’s Rust/C[2] library that uses tree-sitter for syntax highlighting.
    
    [1] https://github.com/leandrocp/autumn
    
    [2] https://github.com/leandrocp/autumnus
    
    fanf
    
    That looks fairly nice!
    
    Also it clarifies what I was wondering about why it seems hard to find themes for styling code on web pages. The ecosystem is (understandably) mostly interested in themes for editors such as Sublime Text or vim, and a highlighter typically decorates code with concrete colours. If there’s an indirection layer between syntactic categories and colours, it usually isn’t exposed in the highlighter’s output.
    
    What I was hoping for (and what I implemented) is a highlighter that produces labelled output (spans with classes) that can be themed with CSS. In particular that supports both light mode and dork mode without re-running the highlighter over the code.
    
    Microsoft VS Code seems to use numbered class names to indicate the syntactic category of each token, so even a web-based editor isn’t using straightforward CSS for its theme system, instead it compiles proprietary themes to CSS. Disappointing!
    
    ahelwer
    
    Nice, I’ve been meaning to write something like this for Hugo for quite a while. It would probably have to be a markdown preprocessing step since Hugo doesn’t seem to allow plugins by design (it wants to stay fast, is the design rationale I think).
    
    conartist6
    
    That’s pretty nice. I like that the devtools then provides you some pretty-meaningful hints as to what’s going on in the code if you really need help because you don’t know the highlighted language