Giving LLMs a Formal Reasoning Engine for Code Analysis
16 points by Yogthos
16 points by Yogthos
Have you actually run evals on the efficiency claim? Fwiw, the kind of program you've written matches exactly what I've envisioned before, but I've mellowed down on these approaches because:
I'm not saying you're wrong, I'd love it if you're right, but I'm not certain if that's the case.
I've been using it locally with Claude Code. I've benchmarked this by having it accomplish tasks on a codebase both the traditional way and using the tool, and then compare token use and the number of iterations it takes to solve the problem. A big part of the initial motivation for me was watching Claude constantly grep through the code, which seems less than ideal for actually finding things in a codebase.
I do find the quality of the model does play a role here. Claude is fairly decent at following instructions, so I've set an explicit directive for it in the global context telling it to always to default to this MCP when analyzing code.
What makes the skill work is having templates the model can fill. I started out by trying to have it write Prolog directly, and as you point out, it's not very good at it, and it took a few attempts to even make working queries. So, then I moved to having a set of general templates for common tasks, the model queries the MCP for skills, and then selects one and fills in the blanks and does some light massaging. It can do that reliably on the first shot. Then whenever it uses a template, they get stored in sqlite, and if a new template keeps getting used, it gets promoted to a permanent set of skills. This way the tool adapts over time to the specific types of tasks you're working on, and has readily available templates for the common cases the model encounters.
I haven't explored the reachability issues with stuff like mocks too closely yet, so that could be a scenario where grepping might work better.
And as far as latency goes, I find this has been noticeably faster, at least for my projects. The LLM just has to make a single query here compared to multiple greps with roundtrips to the server. The data is local and in memory, so searching through it happens quickly and reliably. With grep, not only do you need more cycles typically, but the whole process takes longer since the model might not find what it's looking for on the first shot.
Another use case here is that you can ask the model to use the tool directly to answer questions about code. If you want to know definitively if a piece of code is being used, or what it's all callers are, this gives you a provable answer you can actually trust.
I think there's a lot more that could be done here going forward, but I find this to be promising at least for my personal use already.
I've benchmarked this (...)
I don't obviously see a mention of the actual results of the benchmark in that reply 😅 The only part that sounded kinda possibly relevant I found near the end of this reply, where you seem to mention:
And as far as latency goes, I find this has been noticeably faster, at least for my projects.
Is this ("noticeably faster ... for my projects") the main finding of the benchmark?
I haven't formally benchmarked it if that's what you're asking. I mainly made it to play around with the ideas from the paper. It seems useful to me dogfooding it so far, that's all I can tell you. Doing full on benchmarking would be interesting, but I just haven't had the time or energy to do it. If you want to give it a shot, I'd be interested to see how it works out. :)
Did you maybe consider making it work as a CLI tool as well? (Or does it already have such an interface? I scanned the article and the readme but didn't read it in full detail so may have missed if yes.)
There's two aspects to it that are on my mind:
I haven't thought about adding a CLI, but certainly should be easy enough to do. I'll add an issue so I don't forget about it. Using it directly might be an interesting use case of itself.
when discovering tools being sometimes created for LLMs, I keep getting this feeling that I wish I myself as a developer had access to such cool tools...
In the very specific case of this article, semgrep/opengrep/codeql/treedb feel like they could be a good equivalent. I wonder if LLMs would have an easier time dealing with these though (on the one hand these tools are dedicated to program analysis and could prevent the LLM from getting bogged down with things unrelated to program analysis, on the other hand there likely is much more regular prolog in the LLM's training data, which would be a huge advantage).
That tends to be the tricky part with LLMs since it's really hard to tell what kind of tool they're likely to use effectively aside from just trial and error. When I started with the project, I was just thinking of making a logic engine available to the LLM in general based on what the paper described. And after I made the first iteration of it, I had the realization that code analysis might be a good fit. So that's the main reason it's Prolog based. A nice part with this approach is that you're not restricted to just code analysis. If you have any problem that can be expressed using a formal grammar, now you can just ask the LLM explicitly to use the tool and do that.
I’ve tried in a more limited way with a skill to give LLMs access to Swi-prolog. If prompted they will use it, but (presumably because using it is not a trained behavior) it’s very difficult to get them to do so consistently. Honestly even other tools that are intended for use to search code is rarely used reliably.
The big problem with LLMs is that they essentially forget stuff, so either the reaction has to be trained in (using reinforcement learning or otherwise) or the agent harness has to be designed to remind the usage for something to get used regularly.
I strongly suspect this is why using a harness with the model tuned for it seems to give better results than mixing and matching.
That's the biggest issue I'm noticing as well. You can give instructions to the LLM and remind it to use tools, but ultimately they won't do it reliably unless you explicitly ask them to. I completely agree that tuning a model to a harness is likely the best approach here.
Another direction to go in is to take control away from the LLM entirely with the harness deciding what tools to use. You might be able to use a combination of a big model whose job it is to reason about the code, and what to do, and then a small model trained on a harness which makes decisions about the tool use. So, the model doing the work would have to ask the harness for what it wants, and the harness would run the appropriate tool and return the result.
this is so damn cool! I had Claude implement this idea straight up with the native golang toolchain, skipping tree sitter and prolog. Still evaluating but it cuts down time to research code considerably!
Oh interesting, Go would definitely be snappier. I mainly went with Js for portability and to have the REPL workspace to accumulate results which maybe isn't that important for the code analysis case.
This already exists in several programs. One that I use it does AST parsing and all these other things is called Kreuzberg which has native trees that are built in.
The goal here is to make a general purpose MCP which gives LLMs access to a solver and a Prolog engine. Using this for reasoning about code graphs is just one obvious use case, but it's usable for anything that can be expressed using formal logic. The tool is also template based, and as LLM solves problems using templates, new ones can be organically created as and saved in a db. So, the capability of skills the LLM has access to grows.