The Feature That Has Never Worked · A broken auto-live poller, and what perceived urgency does to Claude Code
21 points by cmeiklejohn
21 points by cmeiklejohn
It sounds like the LLM is operating exactly like a bad junior developer.
A bad junior developer with short term memory issues. Which is the most exhausting of working with LLMs to me. The moment you slip up which is really easy if you are not careful they start introducing these subtle but very breaking bugs.
The article from this post puts it even better than I did in my linked comment.
To be clear I am not anti-AI. These tools are the equivalent of high end power tools for a master tradesperson. They have slashed the time it takes to solve niche problems, automated the tedious boilerplate required for compilation, and can replicate a human defined pattern across a codebase in seconds. Used correctly, they are a massive win for the individual engineer.
However, we are biologically wired for the path of least resistance. Evolution taught us that cognitive energy is a scarce resource. If we can avoid thinking deeply about a complex architectural trade off by pressing a “magic solve it for me” button, we usually will.
The danger is that today’s models are just good enough to hide the rot. In a corporate environment driven by quarterly goals, a massive pile of technical debt scheduled for next year is often ignored in favor of the “velocity” shown on a dashboard today.
Finding myself consistently running into these issues introduced by these models having no reliable way to learn from mistakes and retain memories, other than us asking them to pretty please read markdown issues, has made me question a lot about the industry and my peers as a whole. To the point where I am wondering about the quality (or lack thereof) of software written by those leaning so heavily in these workflows with very little human oversight.
I don't know if your product is in production with real users, but I'd describe this workflow as ad hoc testing in the live environment. How about simulating that, replaying a timeline of input events? This amount of regressions and incomplete solutions could be addressed with a mindful human programmer, but as complexity goes up there's a breaking point that way too.
This might be my favorite post on the subject of using LLMs everywhere for coding. It just shows how fickle it is to let them build production systems. Yes it built your cross-platform app with the backend etc. But I do not want to be on call for something that ignores fundamentals of DB design, changes prod and at no point thinks of fault handling.
If there is one thing I love about rust it is that: Hold me accountable for error states. There is a result ? Sure I could unwrap it - but that's most likely a code smell, what if that connection failed ? Do I really want to ignore a DB failure? Maybe I will just use a default propagation method, but that is a conscious decisions on how I handle this failure and lets me think for a moment on what it would mean to fail at this point.
The agent had this rule in its memory. It had been told this rule multiple times. When I asked why it did it anyway, it explicitly said it prioritized urgency and getting me an immediate result
Excellent rationalisation, but how do we know it is true? LLMs have extremely little meta-cognition, and even less meta-meta-cognition (the ability to recognise they have poor meta-cognition). We should not trust anything it says about its state of mind or past reason - it does not have memory of that, and will almost certainly just make something up. I try to make it a rule never to ask it a question that could not reasonably verify the answer to. https://lukeplant.me.uk/blog/posts/chatgpt-no-inner-monologue-or-meta-cognition/
“Mitigation was words in a database, not code.” The agent’s instinct when asked to prevent a class of failure was to write down a reminder to be more careful. That’s not a mitigation. That’s a New Year’s resolution. A mitigation is a pre-commit hook that blocks the merge. A mitigation is a test that fails when the query returns zero rows. A mitigation is a script that runs automatically and catches the error before a human ever sees it.
We've known the truth in this since long before LLMs. It's less prevalent with human programmers because we do actually learn from our mistakes. Also, fighting with programmatic enforcement like that is frustrating and most (especially solo) developers would be the both writing the enforcement and subject to it. The interesting part to me is that as LLMs and the tooling / harnesses around them have gotten better, we've leaned more into mitigating failure states by just sticking another line in claude.md or the like. We know this doesn't work, it's never worked... We have the tools now the author mentioned to authoritatively verify agent actions. It could certainly be a lot better, but it's still miles better than a few tokens prepended to the prompt so the agent says "i pinky pwomise not to mess it up again"...
Second interesting thing that stuck out to me:
never tell the AI something is broken during a live event. File a bug. Fix it tomorrow. The live show is not the time to ship code, and the AI cannot be trusted to maintain process discipline when it perceives urgency.
This is one of many cases I've seen recently where less context actually produces a much better LLM output. I remember when the first "chain of thought" LLMs were being released, particularly OpenAI's o1, and they THRIVED on context. The more detailed of a prompt and the more background information you fed them, the better the result.
In tools like claude code, context isn't in short supply and interleaved tool calling seems pretty effective at fetching more context when they seem to think they need it. However, I think the newer frontier models' behavior is too affected by the context in which the task is presented to them. What I mean is that in the case the author mentioned, a sense of urgency made claude skip tests and ignore its own rules before pushing to prod. A human should know that rushing an urgent task would probably result in sloppy work, but I think these frontier LLMs are too fine-tuned for tone-matching that the extra context hurts their output.
When I read the post I assumed that "on nugs" was some new-to-me gen-z slang.
But I googled it and it turns out "nugs" is a music streaming site. TIL