The Claude Coding Vibes Are Getting Worse

46 points by ciferkey

I think part of the frustrating loop with these is that

the core metric that you evaluate these on is very difficult to quantify ("how good is it at my tasks") and has lots of confounds (your tasks change over time, your expectations change, etc)
even if you can quantify it, they can make changes in a way that doesn't let you use the old thing
there is also no way to know whether or not the service runners are fiddling with stuff and not telling you

viraptor

Does anyone have a link to that continuously running benchmark checking for daily quality regressions? It's annoyingly hard to find now.

Edit: had to use Claude to find it https://marginlab.ai/trackers/claude-code/

simonw

This morning I had the brand new Claude Opus 4.7 and Qwen3.6-35B-A3B - a 21GB model file running directly on my laptop - draw me pelicans and I liked the Qwen one better.

jcelerier

Tried the qwen 3.6 q4_m but it was completely unable to write me a simple shader after 4 tries, something that old Sonnet versions did without making a mistake.
elobdog

Hahahaha... I learnt something new today about pelikans riding a bicycle! Thank you.
josephjnk

I tried the local version of Qwen3 Coder recently. It was surprisingly capable but I still noticed a major difference between its capabilities and Kimi’s. Also at one point Qwen broke the build, claimed that the build was broken before it started, and then when I told it to fix the build it said this was “out of scope for its the requirements it was given” and refused to do so. So that was fun.
- Student
  
  I’ve found qwen better than kimi. I do wonder why we have opposite experiences. What agent do you use?
  - josephjnk
    
    I was using the local quantized model, not the full cloud-hosted one, which was probably a factor. I’ve also only been doing this for a very short amount of time—I’m not particularly excited about AI coding from a philosophical perspective, but it’s something I’m going to have to be comfortable with for an upcoming job. So I could totally be holding it from the wrong end. I’m using OpenCode and greatly dislike it.
- cpurdy
  
  QOTD from Hyperpape (elsewhere on the interwebs)
  
  The metaphor I subscribe to with LLMs is that they’re like a talisman that accentuates your inbuilt tendencies.
  
  To the extent that you might tend to jump to conclusions or be sloppy, they exacerbate that. To the extent that you are careful, they can be a tool for doing more.
  
  Most of us are neither intellectual saints nor pure slop producers, so we have to be very careful.
- ohrv
  
  "The proprietary tool that is trained using a poorly understood random process with unknown data and evaluated on luftgeschäft benchmarks is not behaving as I've come to intuitively expect."
  
  What a shock it must be!
  - alexandria
    
    I really like the way you put this. Exactly!
- bbrown
  
  I think GIGO is alive and well as is Conway's Law: a program reflects the engineering culture that developed it.
  
  https://techtrenches.dev/p/the-snake-that-ate-itself-what-claude
- matheusmoreira
  
  Adaptive thinking is the only thinking-on mode, and in our internal evaluations it reliably outperforms extended thinking.
  
  Boris literally went to HN and advised people to turn off adaptive thinking because it was buggy to the point of allocating zero thinking tokens to important things.
  - jrgtt
    
    Is it like expected of one to know what a techfluencer has said in an HN thread in order to run reliable software nowadays?
    
    matheusmoreira
    
    He's not an influencer, he's the guy responsible for Claude Code. I see what you mean though. HN being the tech world's unofficial support page is a somewhat perverse outcome but at least it allows people to engage with insiders. Complaining loudly enough until it reaches the HN frontpage and some employee sees it seems to be extremely effective. For some companies like Google that seems to be the only way to get support.
    
    About the issue I was referring to: there were issues with adaptive thinking that would cause it to allocate literally zero thinking tokens sometimes, leading to stupidity.
    
    The data points at adaptive thinking under-allocating reasoning on certain turns — the specific turns where it fabricated (stripe API version, git SHA suffix, apt package list) had zero reasoning emitted, while the turns with deep reasoning were correct. we're investigating with the model team. interim workaround: CLAUDE_CODE_DISABLE_ADAPTIVE_THINKING=1 forces a fixed reasoning budget instead of letting the model decide per-turn.
    
    And now, less than two weeks after that incident, they release a model where it's not possible to disable adaptive thinking!