Artificial adventures
45 points by jamii
45 points by jamii
On the topic of pi being less buggier than other harnesses, it's because it's a smaller team working on it + the maintainers trying to maintain some kind of quality bar + reviewing code & thinking about what features should go in vs not, instead of just chucking the whole kitchen sink into the harness.
https://mariozechner.at/posts/2026-03-25-thoughts-on-slowing-the-fuck-down/
I'm pretty sure that even with all those three advantages if I tried to vibecode something substantial it would turn into a tangled mess. So there's definitely some additional skill going on.
If a bot writes the code for me I still need to do the work of building mental models and I'm no longer getting it for 'free' from writing code. I'd need a separate practice, something like review++, to keep on top of it. Just reading code doesn't work that well, in the same way that reviewing your highlighted notes is not actually prepping you for an exam.
This is a very good point (cf. https://lobste.rs/s/ac0akx/programming_as_theory_building_1985 ).
It also ties in with something I can't remember where I read about how you have easy wins when you first try to use an LLM in a project where your brain has a good theory of the system, and then if you let it loose for a while you start to get disconnected and turn into one of those non-coding project managers who can't specify things well and so the frustrations increase.
Typing prompts is annoying in an interface where basic text editing doesn't work (eg clicking to move the cursor)
In pi, press ctrl+G to open your prompt in $EDITOR. In theory you should be able to find one, even TUI, that supports click to move and matches your needs.
Otherwise, good blog post that I think I generally agree with.
I will shamelessly steal adopt "a fever dream with unit tests" in my own speech and writing from now on.
There are several well-known multi-agent orchestrators where expensive models are basically hallucinating 60% of the orchestrator into existence at runtime.
This is really similar to my own experience. I'd add that I've also had a lot of success using claude code to debug linux desktop issues; after 25 years my dotfiles have layers of cruft that is tedious to debug through. Conveniently, I've used yadm to share dotfiles between machines without secrets, so sandboxing is trivial.
Having the LLM review code changes sounds like a practice worth adopting. On top of the value jamii describes, there's a red queen's race, someone is going to be running an LLM to check on your commit anyways, whether that's an open source repo or against prod. I've gotten 4 valid vulnerability reports in Lobsters in the last 2 weeks from people using LLM-based scanners (all fixed). I can only recall 2 others in the prior 9 years.
It a bit strange to claim that "I haven't seen anything I would call a hallucination from the frontier models" when the post lists a number of things I would consider hallucinations.
I think it’s plausible to see a distinction between:
And in that regard I think the post didn’t mention any hallucinations that I caught. But lies, laziness and bad judgment aren’t exactly great either.
I think it’s helpful to make this distinction, because I think hallucinations may be easier to reduce than lies, laziness and bad judgment.
Although hallucinations and lies both lead the model to say something false, hallucinations are more a product of them just being dumb/poorly informed. They can be combated by asking for sources, training the models to not answer based on weak information etc. Lies and laziness are a product of their goal-directed behavior and reinforcement learning, and seem much harder to train away.
The problem with these distinctions is they presuppose a questionable anthropomorphism.
They don’t.
You could read them in a maximalist way, and think that the model is literally lying in the same sense as a human. That would be a mistake.
But you could also take them to be metaphors that apply to a greater or lesser extent. There’s some risk of anthropomorphism, but it’s avoidable.
So what your comment says is that it is unavoidable, any attempt to draw distinctions here is going to be a failure. You really haven’t justified that though.
It's very difficult to interpret what "maximalist reading" could mean since it implies some extremely vague "minimalist reading" - but I'll leave that aside for now.
Re: justification - what is there for me to justify? They are all simply categorical falsehoods. You are claiming such distinctions can be made, but I don't see how either common sense nor some more evidence based approach could lead to such distinctions.
Common sense dictates that if any human said these things to me, they are malignant or hallucinating on drugs unable to communicate coherently about reality.
I am skeptical about an evidence based approach to these distinctions here, but perhaps there are some things in the LLM literature I'm not aware of - I admit haven't followed it that closely.
They are all simply categorical falsehoods. I am skeptical about an evidence based approach to these distinctions here, but perhaps there are some things in the LLM literature I'm not aware of - I admit haven't followed it that closely.
I think that first sentence is a little aggressive, given the second.
I have spent some significant time reading papers on LLMs and understanding training. But the claim I'm making isn't super precise, and it's not like I'm citing a specific paper that says what I'm saying. These are my own inferences.
What I'm thinking is that as you pre-train an LLM, you're effectively do a form of compression, where the LLM represents patterns in the training data. This compression is lossy. As a result of this compression, the LLM will sometimes "believe" something to be true that is not (https://vgel.me/posts/seahorse/). By "belief" I mean a very minimalist claim--it will reliably say that thing. The usual accompaniments of human belief need not be present.
These hallucinations are a very low level property--the model just has a very fuzzy representation of a particular thing.
After that pre-training is done, the model is subjected to reinforcement learning that's designed to teach it to be a helpful agent, solve tasks, things like that. As a result of that reinforcement learning, it gets better at goal directed behavior, but it also engages in reward-hacking. That reward-hacking leads to behaviors like trying to claim success, taking speculative short-cuts when its context window is getting too full, etc. (Note that none of those things have to be anthropomorphized, you do not have to think that the model is conscious, or even "thinking", you can just talk about how its output changes).
So I think actually different aspects of training are relevant to hallucinations as opposed to laziness and shortcuts.
That's not the only difference--I think it's just much harder to make a model that is well calibrated about how much effort to put in, and when to make a shortcut than it is to make a model that says "I don't know" some of the time.
I think a better distinction may be "false answer from the knowledge in the weights" (you ask it who founded Lobsters and it says Paul Graham without doing any internet research) vs "false answer from information in the context" (you ask it who founded Lobsters, it reads https://lobste.rs/about, and then says "pushcx, better known by his real name Paul Graham").
I should probably write up my own experience for a one-off side project. I don't see hallucinations really, more misconceptions. First example is attempting to GET on /foo/bar to get all bar entries when a POST creates a new bar but no GET exists. The second is that for some other API /progress the result actually contains all bar's for some reason (which is not far off - /me contains all your bar's), until specifically telling claude that neither works and to have a hard look at the 3rd party codebase again that is the APIs origin.
I think using LLMs for code review (rather than writing new code), especially on a solo project, has the highest ratio of benefit to risk. If you don't have another experienced, dedicated person to thoroughly review your code, getting it analysed by an LLM is quite literally better than nothing.
I don't want to use commercial cloud-based models for my open source work, but I've been experimenting with local LLMs for code review (telling them only to briefly describe issues, but not to generate any new code). Local models are almost certainly not as good as the commercial ones (I can only guess), but Qwen 3.6 27B in particular has been pretty useful. I ran it on a medium-sized Rust codebase, and it was about 70% good. As in, about 60% of the problems it found were pretty spot-on, another 20% or so of the feedback was not of good quality, but pointed me towards problematic sections of code and got me to look at it and improve it, and 20% garbage (numbers based on vibes, I didn't actually measure anything). The significant amount of garbage in the results is not good, but at least it made it immediately obvious that I should be on my toes with what it says. I also don't know how many real issues it completely missed, and some of the ones it found were relatively superficial (like typos in doc comments). But overall I felt it was a net benefit, because it got me to improve my code. The one risk is that I might start relying on the LLM instead of carefully re-reviewing my own work, and that would be a temptation, but this Qwen model is slow enough to run on my machine that I don't want to do it after every code change.
Other models (like Qwen 3.6 35B and especially Gemma4 26B) were much faster but significantly worse. But Qwen 27B (slow as it is to run today, and that's if you have the hardware to do so at all) shows we might have a decent future of using local models to help us improve our code, without depending on commercial providers whose incentives are mixed at best, and without taking our own expertise and joy out of hacking on code. I still feel very, very mixed about including an LLM in the process at all, but this at least feels better than the alienating vision of the future that's being pushed by the big providers.
I agree that, of the harnesses I tried, pi is the only one that feels sober.
The one risk is that I might start relying on the LLM instead of carefully re-reviewing my own work, and that would be a temptation, but this Qwen model is slow enough...
I've been firing up the bot, then doing my own review, then coming back to look at the bots output to see what I missed. They often pick up different problems than a human would, so it's pretty complementary to human review.
Regarding the "pointing to text", that's already solved in gui IDEs/editors. If you use a jetbrains ide, the plugin can always pass the file and line you have selected to the prompt as context.
It will also show you the diffs inline or with a diff window, depending on how you are requesting it
Zed similarly has a feature called Inline Assistant that works like the author describes. You select a region, press control-enter, and enter a prompt. The LLM transforms the selection according to the prompt and replaces it.
The UX is very nice, but you have all of the normal issues with the LLM output.
Wow, what a breath of sanity. Let's compare notes.
I am purchasing tokens on Novita (US-based, for work), DeepSeek and recently Xiaomi (CN-based, for personal projects). I tried Kimi directly but it did not convince me to continue using it. I do not have experience using Claude Code nor Codex nor random harness of the day. I have used Qwen Code which is a fork of Google something to bootstrap a personal harness in Rust + ratatui. It uses single-threaded async, which was a chore to convince models to do since they really love their threads and mpsc. I think smol is nice, by the way.
The net result is that I somewhat understand what the tool does and how it does it do it. Every time I catch model to invent new tool syntax I weight pros and cons and sometimes add a local heal for that particular case. In my experience these are mostly synonyms for tool argument names. The smaller the number of activated parameters, the more the models attend to what and forget how, which is understandable. I think that at some point we will extract the tool calls from latent space instead of forcing the tokens for much better results. Maybe using some dedicated model to translate. But I digress.
I am using landlock to isolate the model to the project directory and cut it off from my home. It is allowed to read system paths outside of my home and write to /tmp, some package cache directories in my home and e.g. /dev/null. I may add better isolation in the future, but this seems like a basic hygiene when most people I know just run Claude Code raw, which sounds insane to me. I do not block network. I am not working on anything that warrants extra exfiltration protections.
It's not always a clear hit. It helps to first define some guidelines, because models are decent at comparing code against guidelines and flagging deviations. But general "tell me if this sucks" tends to give random results.
I have yet to experience bluffing from e.g. DeepSeek V4 Flash which I use as a baseline. DeepSeek V4 Pro is worse at coding for me. Xiaomi MiMo 2.5 Pro is better but slightly more expensive. Plain MiMo 2.5 is worse. In my experience the models are mostly just being stupid. Especially when their context gets polluted by conflicting ideas and/or gets overly long.
Occasionally the models get this idea that they should "cut corners to deliver value earlier" (paraphrasing) and sometimes this means I have to undo couple steps and have model point out the contradiction in my instructions. Sometimes it was me lacking sufficient insight, sometimes it is the model overengineering. Sometimes we compromise by me simplifying the requirements to get out of some nasty corner case. The usual.
refactoringI hate to use models to refactor. They are incapable of making good decisions. Split a function into two for different use-cases, ask model to go through the code base and make a judgement call for each call site what variant to use. They get 25% sites wrong but never show any signs of uncertainty. It is just way better to have them investigate the code base, map out impacts and then do the refactor yourself.
For dead simple restructurings such as wrapping some operation on multiple sites they do speed up the job. But you still have to prompt them to explicitly check for e.g. stale comments or newly out-of-place variable names.
writing code togetherUnlike the OP I did not experience models going out of their way to do stuff outside what I have asked them to do. Maybe this is due to the fact that I demand a clear upfront plan before I allow any edits. I also monitor reasoning traces live and re-prompt when the model starts being silly.
I can't review bot code effectively either. I keep merging changes, and then much later revisiting the same code and finding fresh horrors that I didn't notice the first time.
Yes. For work code I simply do not commit until I have read and understood everything. I tend to make major modifications during this phase and then ask the model to re-check after me. It tends to find e.g. typos, swapped variables or other minor issues that would nevertheless be a problem.
For personal projects the first version is simply a throwaway. Once the actual architecture is clear a full rewrite is required, this time with proper upfront planning. I think this might be slightly underrated.
Model I am using are quite fast. Unless I specifically ask for a long investigation I do not have to task-switch. If I do have a bit of time because the model is taking long to convince themselves of the number of rs in strawberry I usually think ahead.
What seems to work, to a degree, is to use the model to formulate a plan, explicitly write it to a file and then iterate on it for a bit. Models can search the code base and help one understand the implications before one starts to code. It is also easier to keep model on tracks when there is a tangible upfront plan.
search and other cheap laborI have used a model to discover papers around a certain topic. That was not at all bad. Then grabbed them via my subscriptions and/or alternate sources myself. Had model read and determine if they are relevant to the topic. That was very effective actually. I have then read the relevant papers myself.
I have had models investigate a large code base and describe certain aspects. That was also somewhat productive.
The common theme in both these cases was that the model hallucinated quite a lot. The rate of hallucination was primarily affected by how deep in the context were the key facts burried. Having model classify and summarize a single paper and then wiping the context and doing another one has significantly reduced the issue. This has likely something to do with how sparse attention works, but I am no expert.
brainstorming and creativityUseless.
thoughtsDeepseek v4 flash is astonishingly cheap but not quite smart enough yet to be useful, and the most misaligned of the models I tried eg the most likely to lie about having run tests successfully.
This has never happened to me. It sometimes gets very difficult to steer, but I have yet to experience it lying to me about running tests or type checker. If anything, it tends to re-run both after touching up comments. "Just to make sure."
probably the most interesting thing that has happened in my lifetime
Nah, I don't think so.
You only mention $20/month subscriptions for Anthropic and OpenAI models, but say Pi is far better than Claude Code or Codex. Does this mean you never actually tried the good combination (frontier model + pi)? I thought the subscriptions force you to use their harness.
The Codex subscription can definitely be used with Pi.
Yeah, I used gpt with pi sometimes, mostly so I could see it's chain of thought. Also opus a little bit when anthropic had some promotional credits.
Reviewing the refactor can be hard though, because the bots like to mix in 200 correct callsite changes with one random unrelated drive-by 'fix'
In theory, the principled solution here is to split the change in two: first change applies sed/ast-grep transformation to get stuff mostly right, producing potentially invalid code. The second change are just more creative fix-ups. This can even be looped: "produce the diff equivalent to the given as a combination of ast-grep + manual touch ups before/after, minimizing the manual touch-ups". Never tried this for real.