An AI agent coding skeptic tries AI agent coding, in excessive detail
42 points by facundoolano
42 points by facundoolano
Minor, but:
Historically, LLMs have been poor at generating Rust code due to its nicheness relative to Python and JavaScript.
I’ve never encountered this issue before. Almost every model that I’ve used has one shotted the Rust code I’ve wanted since very early on in LLM history.
[meta] It's quite entertaining to see these exchanges, and I suspect one of the factors that drives some "trad" developers up the wall (including the fierce faction here on Lobsters), is that the quality of LLM output is contextual. Even with model version, random seed being equal there are ~infinite factors (user prompt, harness prompt, repo context etc.) that influence output quality
Most of my attempts at using LLMs with Rust with e.g. Claude 4 were with a project that used the Axum framework. It had serious difficulties producing a handler signature that would compile. Agentic tools sort of paper over this as it lets it loop over experimenting with the errors and shapes and go back for more information but that still a slow process. 4 also had similar difficulties with Kotlin that 4.5 resolved, so I'd guess 4.5 is better at Rust.
If you're talking like copilot or "manually pick the files for the context" iterations of tools (you know, mid 2025 tooling)? Not a hope.
Same with axum, though mostly with changes from axum 0.8.0. It's quick to correct these, though I'd suggest putting updated context from the blog post in your AGENTS.md.
I do remember that last year (probably Claude 4 era) it seemed to have a hard time with the borrow checker (much as I did as a very experienced programmer but Rust beginner): being unable to give an actionable response to borrow checker errors or even consistently explain what they meant, and as a result going in circles trying to fix them, throwing in random lifetimes when the problem was the code was just not structured in a way that would ever work in Rust. With 4.5 and up, this hardly ever happens anymore.
It's even weirder now - on a couple of occasions it seemed not to be writing the code I asked for and when I dug in with harder prompting it turns out it was impossible due to a lifetime issue, which the model "knew" and was trying to work around it in advance.
I have intentionally made my writing voice more sardonic to specifically fend off AI accusations.
That sounds like something an LLM would say if prompted to sound less like an LLM. /s
"skeptic"? Words mean things. This is critihype.
From the article:
The real annoying thing about Opus 4.6/Codex 5.3 is that it’s impossible to publicly say “Opus 4.5 (and the models that came after it) are an order of magnitude better than coding LLMs released just months before it” without sounding like an AI hype booster clickbaiting, but it’s the counterintuitive truth to my personal frustration.
Contrary to the author's expectation, I think the part of this that makes them sound like an AI hype booster is their completely vibed use of "order of magnitude". I don't think that phrase means that they think it means.
A year ago, I was one of those skeptics who was very suspicious of the agentic hype, but I was willing to change my priors in light of new evidence and experiences, which apparently is rare.
Are you saying the author is just lying then?
Despite researching and developing tooling around LLMs even long before ChatGPT, I haven’t been fond of using LLM code copilots such as GitHub Copilot for coding assistance.
[…]
Claude 3.5 Sonnet has made me rethink things. Due to whatever secret sauce Anthropic used in its training, the latest version of Claude 3.5 Sonnet (claude-3-5-sonnet-20241022) has incredible prompt adherence for all types of prompts, especially coding prompts.
I wouldn't say he’s lying, but „I was a skeptic until now“ seems to be his shtick.
FWIW I think the distinction is beween a) 'skepticism' in the sense of, at a particular moment, doubting some particular claim or use and as b) skepticism as shorthand for being part of the community that is broadly, enduringly against LLMs and other recent generative stuff.
I'm not saying one definition of the word is more accurate than the other; 'skepticism' is used in the latter way in other fields, not to refer to a specific doubt about a specific thing but to broadly indicate what side someone is on. I'm just saying these are both common ways to use the word, and he's got one in mind and other folks have the other.
Since 2020 he'd been playing with this stuff. In the post you quote he decided he was no longer so doubtful that detailed instructions would make a difference because he'd seen more prompt adherence from newer iterations of the tech. (That doesn't mean he thought they could do, say, the things he does with them in this post; I can tell you they couldn't.) Later that year, after saying that, he said "Vibe coding with coding agents like Claude Code or Cursor is something I have little desire to even experiment with." Then later, after more specific changes named in the post, he said now he does find those tools interesting.
The effort to figure out how far the tech could go since 2020 could disqualify him "a skeptic" in the sense of being a part of the skeptical community. Disbelief in specific claims about the usefulness of specific tools or techniques at specific points in time also explains saying "I was skeptical that all this work on prompts mattered" or "I was skeptical of the utility of Claude Code/Codex." It is not surprising to me his opinion of the techs' capabilities changed as new things came out; the capabilities of lots of technologies do change over time.
Of course, understanding the definitions people are using doesn't bring people any closer together on the substance, and some folks will probably say that the post's usage is in some way illegitimate. But descriptively I think that's what is going on!
What do you think skepticism means? My understanding is that being skeptical is about having a negative prior towards new claims, but still updating on new evidence.
Good article; it is another tool worth knowing and like all the others - in some contexts it helps, it some it does not.