Grounding LLMs with Recursive Code Execution

10 points by Yogthos

Corbin

One problem is that LLMs tend to be messy coders; they forget semicolons, use hallucinated imports, etc. The way around that is to add a self-healing layer. If the sandbox throws a syntax error, a lightweight intermediate step attempts to fix imports and syntax before re-running.

Competent harnesses allow users to directly control which tokens are emitted. There is no reason why a model has to ever emit invalid syntax.

Yogthos

Yeah, that's basically what I'm doing. I just fix the obvious problems in the interaction layer using a linter.

What I've found interesting is that the approach can work well with even tiny models. For example, I've been using qwen2.5-coder:7b in my testing, and it converges on a solution very quickly. This is basically a task that a model this size simply would not be capable of accomplishing on its own if it was trying to directly analyze the document. A run looks something like this:

npx tsx src/index.ts \
  "use javascript to write code to find the total sales amount across all regions?" \
  ./test-fixtures/scattered-data.txt \
  --verbose
Configuration:
  Provider: ollama
  Model: qwen2.5-coder:7b
  Max turns: 10
  Timeout: 30000ms

[RLM] Loaded document: 4,706 characters
[RLM] Sandbox created (maxSubCalls: 10, timeout: 30000ms)

──────────────────────────────────────────────────
[Turn 1/10] Querying LLM...
[Turn 1] Executing code:

// Step 1: Search for data related to sales
const hits = grep("sales");  // Returns array of {match, line, lineNum}
console.log(JSON.stringify(hits, null, 2));

// Step 2: Process results - extract sales amounts from each matching line
let totalSales = 0;
for (const hit of hits) {
    const saleAmount = parseFloat(hit.line.match(/\d+(\.\d{1,2})?/)[0]);
    if (!isNaN(saleAmount)) {
        totalSales += saleAmount;
    }
}

// Step 3: Store the result in memory
memory.push({ key: "total_sales", value: totalSales });

// Final output
console.log("Total sales across all regions:", totalSales);

[Turn 1] Console output:
  [
  ...
  {
    "match": "Sales",
    "line": "- Sales Data: Revenue in USD before taxes",
    "lineNum": 122,
    "index": 4357,
    "groups": []
  }
]
[Turn 1] Error: TypeError: Cannot read properties of null (reading '0')

──────────────────────────────────────────────────
[Turn 2/10] Querying LLM...
[Turn 2] Executing code:

// Extract sales data from the logs
const salesData = [
    { region: "NORTH", amount: 2340000 },
    { region: "SOUTH", amount: 3120000 },
    { region: "EAST", amount: 2890000 },
    { region: "WEST", amount: 2670000 },
    { region: "CENTRAL", amount: 1980000 }
];

// Calculate total sales
const totalSales = salesData.reduce((acc, curr) => acc + curr.amount, 0);

// Output the total sales
console.log("Total Sales:", totalSales);

[Turn 2] Console output:
  Total Sales: 13000000

──────────────────────────────────────────────────
[Turn 3/10] Querying LLM...
[Turn 3] Executing code:

// Output the total sales
console.log("Total Sales:", totalSales);

[Turn 3] Console output:
  Total Sales: 13000000
[Turn 3] Final answer found after code execution

[RLM] Sandbox disposed
The total sales are 13000000.

One thing I found is that you do need to tune prompts for a specific model. They each follow different types of instructions and hints better. So, you kind of have to tweak the prompting to get reliable behavior. Some models are also just not good at doing things like figuring out they need to parse currencies or dates, and there's not much you can do there since if it doesn't know it needs to write the code it's not going to try doing it.