Gecko: a fast GLR parser with automatic syntax error recovery

15 points by abhin4v

marijn

I'm very confused why this write-up, which does extensive comparing to other tools, fails to include any mention of tree-sitter. That's both a lot more recent than most of the tools mentioned, and, from the sound of it, a lot closer to how this one works (also GLR, using the same GLR search technique for error tolerance).

matklad

I am not too confident in this statement as I haven't played with actual grammars, but I want to share my experience that the following approach to recovery (minimal-cost error recovery)

Gecko’s error recovery guarantees that the parser always produces parse trees corresponding to syntactically correct inputs. The way it achieves this is conceptually simple: some tokens before and/or after the error point are ignored, and the remaining token stream is syntactically valid according to the grammar.

is unsuitable for language servers. For the LSP use-case, you want error resilience rather than recovery. See https://matklad.github.io/2023/05/21/resilient-ll-parsing-tutorial.html#Why-Resilience-is-Needed for the spelled-out argument.

iv

Yep, exactly. And you also need a concrete syntax tree, so skipping errors entirely can't work. (I think this is slightly orthogonal to resilience/recovery.)
- osa1
  
  What do you need a concrete syntax tree for?
  - bitwalker
    
    A CST corresponds directly to the syntax as written in the input, as opposed to an AST, which is more abstract (hence the A), and (usually) can only be constructed from a valid parse tree. That's a problem in an LSP context where you still want useful functionality even in settings where only part of the file is syntactically valid.
    
    In other words, a CST remains useful even when a source file is only partially valid.
    
    That's not to say you can't make stuff work without one, just that if you are going to build an LSP, you'd be better off having both CST and AST representations.
- osa1
  
  you want error resilience rather than recovery
  
  I read both of the blog posts but I don't understand what you mean by this. You need to recover from errors to be able to do anything at all with the input, so it's not one or the other, you need both. Am I missing anything?
  
  Is the problem with the Gecko's approach that it throws away some of the erroneous parts of the code? If so, it should be easy to recover them from the ASTs if you store first/last tokens the nodes.
  - matklad
    
    I wish I there was Gecko playground to show concrete examples but the code issue is this (quoting resilient blog post):
    
    the parser also recognize valid partial prefixes of various syntactic constructs.
    
    For a snippet like
    
    fn fib_rec(f1: u32,
    
    The parser should produce a tree like
    
    File Fn 'fn' 'fib_rec' ParamList '(' Param 'f1' ':' TypeExpr 'u32' ','
    
    To help the user write fib_rec function, the LSP must recognize it as a function.
    
    My understanding that Gecko's parser won't be able to produce those kinds of trees without explicit support everywhere in the grammar.
- je
  
  The "key guarantee" section gives a+a*(a*+a) as an erroneous input example, then says Gecko "finds the minimal-cost repair, not just any repair". But my understanding from reading the whole article is that Gecko finds a minimal-cost repair rather than the minimal-cost repair, i.e. "panic mode". Furthermore, it looks to me like this isn't a general guarantee; for example minimal-cost repair may require inserting a synthetic token rather than removing one, but this algorithm never inserts synthetic tokens.
  
  Regarding parsers that "search for a globally minimal-cost error recovery", my understanding is that lrpar does so, with a claimed >98% real-world success rate.
  
  I read this article wondering how GLR could help with error recovery. My takeaway is that the article conflates ambiguous grammar parsing (e.g. the C "typedef" parsing ambiguity example) with error recovery. Furthermore, it seems to me that the GLR merging operation would struggle with differing error recoveries, so GLR actually gets in the way of globally minimal-cost error recovery.
  - marijn
    
    If this is using the same strategy as tree-sitter and lezer, the way GLR applies to error recovery is that GLR already gives you the primitives for efficient parse stack splitting, and an error recovery strategy that proceeds by trying out recovery strategies and picking the most successful one can use that same mechanism to run its search—keep splitting stacks and trying different strategies, then prune off the ones that aren't working and keep going with the ones that look promising.