The Feature That Has Never Worked · A broken auto-live poller, and what perceived urgency does to Claude Code

21 points by cmeiklejohn


jyounker

It sounds like the LLM is operating exactly like a bad junior developer.

kevinc

I don't know if your product is in production with real users, but I'd describe this workflow as ad hoc testing in the live environment. How about simulating that, replaying a timeline of input events? This amount of regressions and incomplete solutions could be addressed with a mindful human programmer, but as complexity goes up there's a breaking point that way too.

proctrap

This might be my favorite post on the subject of using LLMs everywhere for coding. It just shows how fickle it is to let them build production systems. Yes it built your cross-platform app with the backend etc. But I do not want to be on call for something that ignores fundamentals of DB design, changes prod and at no point thinks of fault handling.

If there is one thing I love about rust it is that: Hold me accountable for error states. There is a result ? Sure I could unwrap it - but that's most likely a code smell, what if that connection failed ? Do I really want to ignore a DB failure? Maybe I will just use a default propagation method, but that is a conscious decisions on how I handle this failure and lets me think for a moment on what it would mean to fail at this point.

spookylukey

The agent had this rule in its memory. It had been told this rule multiple times. When I asked why it did it anyway, it explicitly said it prioritized urgency and getting me an immediate result

Excellent rationalisation, but how do we know it is true? LLMs have extremely little meta-cognition, and even less meta-meta-cognition (the ability to recognise they have poor meta-cognition). We should not trust anything it says about its state of mind or past reason - it does not have memory of that, and will almost certainly just make something up. I try to make it a rule never to ask it a question that could not reasonably verify the answer to. https://lukeplant.me.uk/blog/posts/chatgpt-no-inner-monologue-or-meta-cognition/

skycam

“Mitigation was words in a database, not code.” The agent’s instinct when asked to prevent a class of failure was to write down a reminder to be more careful. That’s not a mitigation. That’s a New Year’s resolution. A mitigation is a pre-commit hook that blocks the merge. A mitigation is a test that fails when the query returns zero rows. A mitigation is a script that runs automatically and catches the error before a human ever sees it.

We've known the truth in this since long before LLMs. It's less prevalent with human programmers because we do actually learn from our mistakes. Also, fighting with programmatic enforcement like that is frustrating and most (especially solo) developers would be the both writing the enforcement and subject to it. The interesting part to me is that as LLMs and the tooling / harnesses around them have gotten better, we've leaned more into mitigating failure states by just sticking another line in claude.md or the like. We know this doesn't work, it's never worked... We have the tools now the author mentioned to authoritatively verify agent actions. It could certainly be a lot better, but it's still miles better than a few tokens prepended to the prompt so the agent says "i pinky pwomise not to mess it up again"...

Second interesting thing that stuck out to me:

never tell the AI something is broken during a live event. File a bug. Fix it tomorrow. The live show is not the time to ship code, and the AI cannot be trusted to maintain process discipline when it perceives urgency.

This is one of many cases I've seen recently where less context actually produces a much better LLM output. I remember when the first "chain of thought" LLMs were being released, particularly OpenAI's o1, and they THRIVED on context. The more detailed of a prompt and the more background information you fed them, the better the result.

In tools like claude code, context isn't in short supply and interleaved tool calling seems pretty effective at fetching more context when they seem to think they need it. However, I think the newer frontier models' behavior is too affected by the context in which the task is presented to them. What I mean is that in the case the author mentioned, a sense of urgency made claude skip tests and ignore its own rules before pushing to prod. A human should know that rushing an urgent task would probably result in sloppy work, but I think these frontier LLMs are too fine-tuned for tone-matching that the extra context hurts their output.

joelgrus

When I read the post I assumed that "on nugs" was some new-to-me gen-z slang.

But I googled it and it turns out "nugs" is a music streaming site. TIL