AI Coding Assistants Are Getting Worse

35 points by eduard


satvikberi

This is a really weird test – he forbids the model from providing commentary, but wants it to state that the task is impossible.

I'm much more inclined to trust tests like METR's evaluations – they all have flaws, but METR's basic test seems much more systematic and sane to me.

Anecdotally, I've definitely noticed models asking more questions in the last few months, as well as pointing out that requirements are unclear or contradictory. As far as I can tell, they're getting significantly better overall.

n1000

This matches my own experience. I appreciate the attempt at precision and would like to see other quantitative investigations.

Newer models confabulate readily and convincingly. They're tuned to be part of some so called "agentic loop" and part of that is minimizing the rate of giving up or producing explicit nonsense. It keeps the feedback loop running smoothly but it degrades trust that the final result is what you want.

The verbose newer models are also worse at reviewing code. It's the same thing. GPT-4 et al. will give you some meager output and leave the lazy dev to look at the rest. GPT-5 and Claude 4.5 will be thorough, with bullet points for every thing you wanted (and a dozen you didn't), but you can't trust 'em. Now the lazy dev thinks he's done the work.

It's not just coding, either. For example, Magic the Gathering is a game where you can easily come up with novel questions with objective answers requiring deductive reasoning. Kind of like mathematics but you can ask an easyish question that's not in the training data. Claude Sonnet 4.5 will always respond "That's actually really clever!" and give some plausible bullshit.

Task accuracy going from 80% to 90% doesn't matter if the mistakes are harder to spot.