A tale of two Claudes
57 points by steveklabnik
57 points by steveklabnik
This is part of why having friends is great
This line jumped out at me. I think this highlights one of the (many) indirect hazards of “AI” tool usage. Over time, if it’s “easier” (lower friction and good enough) to rely on a tool instead of a friend, we’ll default to using it more, and weaken our social ties to other humans. It’s fundamentally anti-social.
You’re not wrong on some level. I’m especially worried about this in areas outside of software development.
weaken our social ties to other humans. It’s fundamentally anti-social.
The behavior of the tech community over the last decade has convinced me that a professed interest in social ties and being pro-social is largely lip-service.
Example questions (don’t bother answering, it’s just to give the flavor) to illustrate this:
I’ve watched us dehumanize each other over politics and identity and being insufficiently $X (or too much $X) for pretty much any value of X over the last ten years, and I for one am hard-pressed to suddenly believe that we, when faced with the robots, will suddenly find some deep wellspring of compassion and belief in the common value of man.
Well, if you only let me choose between robots, and humans that demonstrably devalue the life of other humans, yes, I might take my chances with the robots.
I agree with your general sentiment, but I think convenience is the bigger factor.
One worry I’ve had is that junior engineers won’t ask for help because they have an LLM there - the result being that there’s less knowledge transfer and we have to hope that the models can effectively teach people.
“Claude Code” isn’t the same thing for everyone, it’s really just a name for an Anthropic product which uses some underlying model to do its work, and that model can change via many different parameters, including just plain ol’ time, but also according to Claude Code settings via e.g. ANTHROPIC_MODEL env var (with caveats of course). So it’s tricky to draw any kind of conclusions from any particular user experience of Claude Code at any point in time, there are just so many underlying variables that impact those results and are gonna change between users, over time, etc.
This is a great point! I should be including this detail in my posts.
I’m on the $100 Max plan, so I get some amount of Opus 4, and then it falls back to Sonnet 4.
Sure, but if you want to draw conclusions about the quality of responses, you need to control for the specific model and its parameters – not just “Claude Code” or even “Claude Code $100 Max plan” but specifically “Opus 4” or “Sonnet 4”. Hopefully that makes sense.
I get what you’re saying there, but I’m also interested in learning what “Claude Code $100 Max plan defaults” does, too, and I’m unlikely to try for myself real soon, so I’d find a post about that to be quite interesting.
Right, what I meant @peterbourgon is basically what @hoistbypetard is asking here:
The way Max works is like this: your first message to Claude starts a “session.” Sessions are five hours long. You get Opus 4 for the first 20% of your usage limit, and then it swaps to Sonnet 4. After five hours is up, a new session starts. How many messages you get in the five hours depends on context size. From https://support.anthropic.com/en/articles/11014257-about-claude-s-max-plan-usage :
If your conversations are relatively short, with the Max plan at 5x more usage, you can expect to send at least 225 messages every 5 hours
It gives a little indicator in the UI when it swaps from Opus to Sonnet, but you have to like, pay attention to that and record it down if I wanted to share the details of how much was being used with which model, which I did not.
The $200/month plan basically 4xes the $100/month plan, around 900 messages.
If you run out of messages, you end up paying for it like the API. I have yet to hit that limit.
One minor correction: I’m 99% sure if you hit the Max limit, you have to opt in to go usage-based. Claude Code will tell you, and makes it easy to jump to the anthropic web console and set that up. They do make it pretty clear you are switching from a subscription to usage-base pricing model.
Hilariously enough, I hit the Max limit for the first time last night. It stopped, and told me. In my case, I was about 45 minutes away from the next session, so I just took a break rather than switching, but yes, it was very clear. It told me I was getting close to the limit, and that I had reached it, and didn’t automatically swap to the API billing.
This is true if he wants to make a scientific claim about the underlying models. It is not true if he wants to post about the experience of using Claude Code.
Both are reasonable questions, though it’s worth being clear about which you’re asking.
I think it’s reasonable to leave a comment explaining which model even if it is just an experience report: years from now, which model I’m talking about won’t be obvious.
I think this is a bit overstated because I suspect the vast majority of people using Claude Code right now are using Sonnet 4. Differences in choice of project and in prompting are more significant.
x86 ELF assemblers are something I imagine it has several examples of in its training data. Those examples are exactly what you’re trying to do as opposed to “almost but not quite” as in your Tailwind case.
From that perspective it seems less suprising. These tools are very good at doing things that have been done before, and bad at novelty. Your Tailwind case is novel (to the LLM), even if it’s in boring way.
Would be interesting to compare how it does on an assembler for a novel format or instruction set if it’s provided with a thorough spec
For sure, that’s kind of why I wrote down my assumptions here: they’re probably wrong, after some reexamination! This is a very common and mature platform, and even if they aren’t in Rust, it probably does have more examples than I would have initially thought.
Would be interesting to compare how it does on an assembler for a novel format or instruction set if it’s provided with a thorough spec
I think it’s unlikely for this to happen for me for this project, but yeah, that would be a better test of its ability to do something more novel.
These tools are very good at doing things that have been done before, and bad at novelty.
Semi-counterpoint: this was back in Sep 2024 (Claude 3.5, back before the more tightly-integrated stuff, but after they added “artifacts” to store blobs of code).
I was already using Gaussian Process Regression to do the kind of forecasting I was asking it about, but I had a fancier idea in mind and I knew that my current toolkit (tinygp) wasn’t up to it. I’d read up on some of the other approaches but they were all pretty daunting, so I threw an exploratory question at Claude. After a little bit we zeroed in on SVGP using GPyTorch and I got it to write me some skeleton code (which was already a big help, because GPyTorch is an incredibly flexible library and although it does have “getting started” resources, none of them were close enough to my needs for me to figure out how to get from point A to point B!)
Through conversation I got it to show me how to use a black-box external program as a “mean function” (the GP’s job is to model the residuals from a “background model”), to learn how to deal with heteroscedastic data (the observations come with a “confidence score” in arbitrary units and it was necessary to subclass Likelihood
to add hyperparameters to learn the mapping between confidence score and measurement error, which isn’t known a priori), to enable “gradual forgetting” of old data so that the inducing points are optimized for forecasting, and to constrain the inducing points to the actually meaningful subspace of my input space.
There is a break in the middle where I said “okay, this looks good enough to play with” and hacked around on my own for a few hours replacing sample data with real data and stubs with real implementations, but those were the things I knew how to do. And there were a few cases where Claude made silly mistakes (like forgetting that it was supposed to be writing GPU-optimized code, or forgetting that those variances need to have gradients with respect to the hyperparameters) but it did a surprisingly good job of fixing its oversights once they were pointed out.
In the end I came out with something that really did work with just a little more hacking — it optimized, it trained, it predicted (and, alas, it performed slightly worse than the previous version of my model, but so it goes). In a domain that’s fairly niche to begin with, the end result is a confluence of requirements that I’m sure hasn’t been seen in the world before. I just had to approach it stepwise. But I’m sure I wouldn’t have gotten there at all without the “research help”. That’s when I became convinced that, hey, sometimes this stuff is actually worth my while, although I still don’t use it 97% of the time.
Of course I’ve also seen it just be horribly useless, including a case where I asked it if ClickHouse had a certain capability, it searched the docs for me, and told me: yes, it does that, and the function you want is XYZ, which was an absolute hallucination: it doesn’t do that, and there’s no such function in the docs it cited.
This sounds niche, sure, but assemblers are niche too. That doesn’t say much about whether it’s novel though
This is a good point. There’s room for an awful lot of niches in these models. Having unlimited access to a plausible simulacrum of expertise in an unfamiliar domain can be quite valuable, if treated as a learning opportunity.
I like this post because the author seems to like programming instead of describing it as something they want to get away from.
Thanks! This is something I’m going to dedicate a full post to someday. Maybe soon.
I’ve been programming for over 30 years now. I used to find it very intrinsically motivating. Not so much anymore. I do like doing things with code though, and I don’t hate code. So I both understand the people that find this stuff inherently offensive because you’re not writing the code directly as much any more, but also don’t think that vibe coding (by the original definition) is a good way to be a professional.
Still figuring out my own feelings and trying to sort out what I believe and if it’s right.
I had a really good chat with a friend last night about these tools lowering the “activation energy” needed to do certain kinds of tasks. Because sometimes, even if we like to code, some things are boring and tedious. It can make starting those things way easier, and therefore, more likely to happen. I landed a series of diffs in a work codebase yesterday that was roughly 2k lines changed, that was a refactor of some error handling that I didn’t want to do, but became easy when I made a robot do it. I of course still had to double check all that work, but I’d been prioritizing other things because it would have taken me a while to do on my own.
I dunno. We’ll see.
lowering the “activation energy” needed
@simonw talks about this as unlocking side quests. I have mixed feelings about it, myself: on the one hand, I definitely see the value of exploratory programming, and some tedious tasks are valuable. On the other hand… this poster, pretty much. Discretion is still gold.
This result, while neat, doesn’t surprise me.
one of the hunches I have about LLM usage for coding is that more popular tools will be easier for LLMs to use.
This runs counter to my understanding of how LLMs work. More training material means a higher probability of the final weights reflecting all possible associations and avoiding a false local minimum. However popularity brings with it another factor that I think is more powerful. Higher edge cardinality in the conceptual association graph the the training data approximates. This means when the LLM selects from candidates the weights would be more likely to include an unwanted association that would be valid in some objective sense. This is doubly true for generic libraries that fit a lot of use cases.
The thing that makes LLMs particularly good at a task is specificity of associations. There are very few associations with an instruction set other than translation to and from machine code. To some extent translation of IR to machine code as well but IR tends to not be described in prose the same as assembler so it will be underrepresented in the training data. Each instruction has a very specific association. In x86 the instruction variants tend to be associated with specific registers which are necessarily explicit in the training data. As much as we complain about x86 instruction proliferation the number of instruction variations is low enough that trial and error tools can try each pretty easily.
This is amplified by the effect of survivorship bias and authoring authority (awkward phrasing, sorry). Both x86 and tailwind are long lived projects. x86 is authored by a first party to suit its own ends (at least relative to Tailwind). There was no mov instruction in x86 before they needed it. Newer mov variants could be given new names or new operands. These would be specific to that new purpose. Tailwind is authoring their library and documentation relative to the CSS standard. So while they have the same basic survivorship bias as they author docs and users write code with newer versions of their library, there are a lot of associations for which the CSS-sourced tokens remain the same even if the semantic association is changing.
This runs counter to my understanding of how LLMs work.
One thing I need to spend some time with is understanding how these things actually work. So I appreciate this comment. Lots to think about.
For what it’s worth: having gone through the TailwindCSS v4 upgrade myself, that upgrade is dissapointingly fraught with peril. I ran into a few things that simply did not have an transition path. So I’m not the least bit surprised that Claude struggled with that.
(The details: the standalone binary distribution no longer worked, safelist generation had to be swapped out with a statically generated list, config plugins are no longer a thing afaict.)
Ah, thanks. I’m pretty new to Tailwind, so even if I like it, this is my first time upgrading, so I assume that it’s tough because I’m still learning it, not that it’s universally rocky.
It sounds like the rockiness was heavily dependent on which features you were using? It was smooth sailing for me, on a site that’s been around since tailwind 2… going from 3 to 4 was a nothingburger for me.
It’s quite possible. One codebase is only using the Typography plugin, another is using postcss and tailwind-merge. So nothing too intense.
On the site with just Typography, I do have a fairly detailed extended theme (https://github.com/steveklabnik/steveklabnik.com/blob/trunk/tailwind.config.js), and the tailwind upgrade tool didn’t seem to be able to convert it for some reason. I should have taken more notes…
Also having gone through that myself, I’m surprised to hear that. It was almost a no-op for me. I guess I wasn’t using the safelist feature, and I was just using the cli through npm, not the standalone one. My config plugins continued to work until I got updates for v4 css plugins for them. Once I got updated versions of those several weeks later, I moved them to css and deleted my config.js.
The ELF code shouldn’t have an instruction_size
method at all, though? The normal setup is that your assembler encodes instructions to bytes, and emits those to a binary blob that you put in an executable section. The act of encoding an instruction gives you the size that it will be, and having both an encoder and a separate thing that tries to reason how big the instruction will be separately is bug prone and probably duplicating a bunch of logic. In fact, I wouldn’t expect an ELF library to know anything about the assembler either, and to just have an API to add sections and segments with file and virtual offsets.
First of all, I’m not claiming it is good code. I barely gave Claude any guidance, I just wanted to see what it would do.
It’s a little more OO-style than I personally would write, for example.
The normal setup is that your assembler encodes instructions to bytes, and emits those to a binary blob that you put in an executable section.
Sure. Here’s the code:
// Convert instructions to machine code
pub fn assemble(&mut self, instructions: Vec<Instruction>) -> Result<Vec<u8>, CodegenError> {
// First pass: calculate symbol addresses
let mut address = 0u64;
for instr in &instructions {
match instr {
Instruction::Label(name) => {
self.symbol_table.insert(name.clone(), address);
}
_ => {
address += self.instruction_size(instr);
}
}
}
// Second pass: emit machine code
for instr in &instructions {
self.emit_instruction(instr)?;
}
// Resolve relocations
self.resolve_relocations()?;
Ok(self.code.clone())
}
It’s doing two passes, not one. (and then one pass over the relocations that are emitted by emit_instructions
.) That’s fine, for my purposes, right now. Moving it to a single pass seems like good future work.
probably duplicating a bunch of logic.
It’s just a big old match statement returning integers, no logic. Unless you count the duplicate match arms as “logic”. This would be resolved on its own by moving to a single pass, which is good. I’m not likely to mess up a case because the match needs to be exhaustive in both cases anyway.
In fact, I wouldn’t expect an ELF library
Sure, all of this stuff would probably be better, and I’ll probably evolve it into something closer than that in the future. This is like 800 lines of code, it is nowhere near complete or good.
Those instruction size calculations look horrifying. PUSH can have anywhere from 1 to 10 bytes, even excluding useless prefixes, depending on what is being pushed. Most x86 instructions are like this.
I have been wondering if we will see more stable APIs in the future because of LLMs.
Not just this, but I’ve seen people speculate (and some also say that they do this) that they tend to stick to more mature, slower-moving libraries because of this. There’s some amount of danger of this creating a winner-take-all scenario, but at the same time, it’s not like this isn’t already true without LLMs. Which also doesn’t mean further cementing this effect is necessarily good… but also, maybe there has been too much churn, and therefore busy work, and so maybe (the collective) we need some chilling out.
I think LLMs just make what was always a good choice the even more obvious choice. There is real cost in churn and LLMs just bring a new visibility to that. If that is the path, then I think it’s a net positive one.
It also could hinder experimentation by guiding even more people towards a more common/generic way of solving problems. Some of the greatest breakthroughs in any field come from someone(s) going against the common approach/solution.
There is a difference between experimentation and constant API churn. I would not worry about true innovation all that much.
I think you’re dismissing the possibility too quickly, and I don’t think you’re giving any reason.
I hope that real innovation will be easily distinguishable from API churn. I worry that it won’t be.
If tooling makes experimentation more difficult, it’s very plausible that genuinely good ideas will struggle to find adoption. It only makes sense that good ideas will win out if they’re so obviously compelling that users opt for them despite bad AI support.
Raising the floor doesn’t necessarily lower the ceiling, though. Maybe more people will be frustrated with the more easily accessible baseline and have the energy (although maybe not the domain expertise!) to try going further.
I haven’t tried it out yet, but it really seems like a missing feature that Claude Code doesn’t automatically save the prompts/transcripts of what you do with it, especially when compared to e.g. llm
’s logging functionality.
Yes, I have heard that maybe there’s some undocumented features here, but they should be more prominent and documented if true.
I also am a bit unsure about the interface for continuing a convo vs starting a new one; context management matters a ton, but it’s also easy to accidentally forget to pass --continue
.