Giving LLMs a Formal Reasoning Engine for Code Analysis

16 points by Yogthos

Have you actually run evals on the efficiency claim? Fwiw, the kind of program you've written matches exactly what I've envisioned before, but I've mellowed down on these approaches because:

Experience tells me that LLMs struggle much more with using MCPs and skills correctly than with writing complex grep commands and writing one-off Python scripts (seemingly what they've been RL-ed to do).
The point about efficiency was made previously with Aider natively supporting repository maps, but if it was an obvious win, given the copious amounts of feature copying that goes around, it's plausible that this feature would've been widely adopted across agents (it hasn't been AFAICT).
Re: logic-based analysis, I think one case where that might perform worse than an LLM today is with dynamic dispatch, where you have test code and non-test code and the presence of the mocks means the reachability characteristics of code are different, and basically there's no reliable way of discovering test roots across languages using tree-sitter (you could probably hack something together that works well for common cases in newer languages).
From a pitching POV, IDK what your target audience is, but in work contexts as well as product development, I've seen people be more sensitive to latency than price (at least at current token prices + liberal token budget environment), since the tokens are paid for from a pool, but the latency is felt every time you make a query.

I'm not saying you're wrong, I'd love it if you're right, but I'm not certain if that's the case.

Yogthos

I've been using it locally with Claude Code. I've benchmarked this by having it accomplish tasks on a codebase both the traditional way and using the tool, and then compare token use and the number of iterations it takes to solve the problem. A big part of the initial motivation for me was watching Claude constantly grep through the code, which seems less than ideal for actually finding things in a codebase.

I do find the quality of the model does play a role here. Claude is fairly decent at following instructions, so I've set an explicit directive for it in the global context telling it to always to default to this MCP when analyzing code.

What makes the skill work is having templates the model can fill. I started out by trying to have it write Prolog directly, and as you point out, it's not very good at it, and it took a few attempts to even make working queries. So, then I moved to having a set of general templates for common tasks, the model queries the MCP for skills, and then selects one and fills in the blanks and does some light massaging. It can do that reliably on the first shot. Then whenever it uses a template, they get stored in sqlite, and if a new template keeps getting used, it gets promoted to a permanent set of skills. This way the tool adapts over time to the specific types of tasks you're working on, and has readily available templates for the common cases the model encounters.

I haven't explored the reachability issues with stuff like mocks too closely yet, so that could be a scenario where grepping might work better.

And as far as latency goes, I find this has been noticeably faster, at least for my projects. The LLM just has to make a single query here compared to multiple greps with roundtrips to the server. The data is local and in memory, so searching through it happens quickly and reliably. With grep, not only do you need more cycles typically, but the whole process takes longer since the model might not find what it's looking for on the first shot.

Another use case here is that you can ask the model to use the tool directly to answer questions about code. If you want to know definitively if a piece of code is being used, or what it's all callers are, this gives you a provable answer you can actually trust.

I think there's a lot more that could be done here going forward, but I find this to be promising at least for my personal use already.
- akavel
  
  I've benchmarked this (...)
  
  I don't obviously see a mention of the actual results of the benchmark in that reply 😅 The only part that sounded kinda possibly relevant I found near the end of this reply, where you seem to mention:
  
  And as far as latency goes, I find this has been noticeably faster, at least for my projects.
  
  Is this ("noticeably faster ... for my projects") the main finding of the benchmark?
  - Yogthos
    
    I haven't formally benchmarked it if that's what you're asking. I mainly made it to play around with the ideas from the paper. It seems useful to me dogfooding it so far, that's all I can tell you. Doing full on benchmarking would be interesting, but I just haven't had the time or energy to do it. If you want to give it a shot, I'd be interested to see how it works out. :)
    
    akavel
    
    Ah; that clarifies things then, thanks.
  - akavel
    
    Did you maybe consider making it work as a CLI tool as well? (Or does it already have such an interface? I scanned the article and the readme but didn't read it in full detail so may have missed if yes.)
    
    There's two aspects to it that are on my mind:
    
    For one, I wonder if CLI could be actually more effective, both in tokens use, as well as in models potentially being able to use CLI better than MCP?
    
    The other one... when discovering tools being sometimes created for LLMs, I keep getting this feeling that I wish I myself as a developer had access to such cool tools... as people are building for LLMs... (I started pondering this after some article about people writing docs for LLMs posted here on lobste.rs)
    
    Yogthos
    
    I haven't thought about adding a CLI, but certainly should be easy enough to do. I'll add an issue so I don't forget about it. Using it directly might be an interesting use case of itself.
    
    glacambre
    
    when discovering tools being sometimes created for LLMs, I keep getting this feeling that I wish I myself as a developer had access to such cool tools...
    
    In the very specific case of this article, semgrep/opengrep/codeql/treedb feel like they could be a good equivalent. I wonder if LLMs would have an easier time dealing with these though (on the one hand these tools are dedicated to program analysis and could prevent the LLM from getting bogged down with things unrelated to program analysis, on the other hand there likely is much more regular prolog in the LLM's training data, which would be a huge advantage).
    
    Yogthos
    
    That tends to be the tricky part with LLMs since it's really hard to tell what kind of tool they're likely to use effectively aside from just trial and error. When I started with the project, I was just thinking of making a logic engine available to the LLM in general based on what the paper described. And after I made the first iteration of it, I had the realization that code analysis might be a good fit. So that's the main reason it's Prolog based. A nice part with this approach is that you're not restricted to just code analysis. If you have any problem that can be expressed using a formal grammar, now you can just ask the LLM explicitly to use the tool and do that.
    
    rs86
    
    Cool, I have a very similar thing here if you wanna chat. We have arrived on a bunch of the same ideas. I also have an MCP wrapped around a formal reasoning engine.
    
    Yogthos
    
    That's awesome, I'd love to chat and swap notes.
    
    Student
    
    I’ve tried in a more limited way with a skill to give LLMs access to Swi-prolog. If prompted they will use it, but (presumably because using it is not a trained behavior) it’s very difficult to get them to do so consistently. Honestly even other tools that are intended for use to search code is rarely used reliably.
    
    The big problem with LLMs is that they essentially forget stuff, so either the reaction has to be trained in (using reinforcement learning or otherwise) or the agent harness has to be designed to remind the usage for something to get used regularly.
    
    I strongly suspect this is why using a harness with the model tuned for it seems to give better results than mixing and matching.
    
    Yogthos
    
    That's the biggest issue I'm noticing as well. You can give instructions to the LLM and remind it to use tools, but ultimately they won't do it reliably unless you explicitly ask them to. I completely agree that tuning a model to a harness is likely the best approach here.
    
    Another direction to go in is to take control away from the LLM entirely with the harness deciding what tools to use. You might be able to use a combination of a big model whose job it is to reason about the code, and what to do, and then a small model trained on a harness which makes decisions about the tool use. So, the model doing the work would have to ask the harness for what it wants, and the harness would run the appropriate tool and return the result.
    
    cryptix
    
    this is so damn cool! I had Claude implement this idea straight up with the native golang toolchain, skipping tree sitter and prolog. Still evaluating but it cuts down time to research code considerably!
    
    Yogthos
    
    Oh interesting, Go would definitely be snappier. I mainly went with Js for portability and to have the REPL workspace to accumulate results which maybe isn't that important for the code analysis case.
    
    symgryph
    
    This already exists in several programs. One that I use it does AST parsing and all these other things is called Kreuzberg which has native trees that are built in.
    
    Yogthos
    
    The goal here is to make a general purpose MCP which gives LLMs access to a solver and a Prolog engine. Using this for reasoning about code graphs is just one obvious use case, but it's usable for anything that can be expressed using formal logic. The tool is also template based, and as LLM solves problems using templates, new ones can be organically created as and saved in a db. So, the capability of skills the LLM has access to grows.