Reading leaked Claude Code source code
82 points by lr0
82 points by lr0
The idea that these things are configured by just prompting them is scary to me.
You are operating UNDERCOVER in a PUBLIC/OPEN-SOURCE repository. Your commit
messages, PR titles, and PR bodies MUST NOT contain ANY Anthropic-internal
information. Do not blow your cover.
NEVER include in commit messages or PR descriptions:
- Internal model codenames (animal names like Capybara, Tengu, etc.)
- Unreleased model version numbers (e.g., opus-4-7, sonnet-4-8)
- Internal repo or project names
- Internal tooling, Slack channels, or short links (e.g., go/cc, #claude-code-...)
- The phrase "Claude Code" or any mention that you are an AI
- Co-Authored-By lines or any other attribution
It's like giving wishes to a genie - there's always a loophole
It works because of statistics; the undesired behaviors are less likely when conditioned upon this sort of prompt. That said, surely we can imagine even better prompts for conditioning which would work even better? The second-person pronoun "you" is such a bad choice; it contributes to the schizophrenic inability of chatbots to reliably identify chat participants.
I have been using Claude (the web platform) to rewrite English text. It's pretty good, however, it always include em-dashes; which I hate.
Here is my prompt
Improve my English writing. You'll be given text that I expect you to re-write in a better style. - Do not use em-dashes. DO NOT USE em-dashes. - Fix markdown where applicable - Be a human. Do not sound like an AI/LLM. DO NOT USE EM-DASHES. DO NOT USE EM-DASHES.
9/10 I still get em-dashes with this prompt.
That’s a known issue: say what to do, do not say what not to do. Say “editor’s requirement: use short dashes instead of proper typographic ones”
I'm well aware that LLMs respond poorly to negative instructions. My point in adding them is to see what actually gets them back into compliance; which, based on my own testing, seems to be nothing.
Looking at your prompt, in my experienced it is somewhat ambiguous. In my experience adding examples helps. As silly as it sounds it means adding prompt lines like "Don't use Em dashes (—) and En dashes (–) in writing". To be clear, you are not going to get a perfect result, but it will be better.
To the greater point, the fact they are still trying to configure things through prompting is interesting to say the least.
To the greater point, the fact they are still trying to configure things through prompting is interesting to say the least.
What's the alternative?
Dunno, I am not a Anthropic employed engineer. The alternative is guarding against it outside of the LLM. For example, this seems what they have done with the sentiment regex for frustrated responses.
Considering that they are trying to prevent leaks of specific things I wonder why having simple word matches on output are not used. At the very least as a last sanity check.
Anyway that is all besides the point for why I said it is interesting. What is interesting, to me, is that for the most part the control we have over models still comes down to asking them to not do things and hope for the best.
I wonder why having simple word matches on output are not used. At the very least as a last sanity check.
Unfortunately, it is hard to patch up holes this way too. For example, models understand prompts in base64 and can respond in base64, and an adversary (in this case an adversarial open source project?) can come up with creative variations of this approach.
For a thinking model, checks could be applied against the "thinking" output since that is probably in English, but these attempts are also probabilistic like the prompt massaging in the Claude Code codebase.
At home, I've verified that soft prompting, grammars, and steering vectors are all effective.
(I know you didn’t ask for it, but it made me think)
Try adding something along these lines at the end of the prompt:
Once you’re done, edit the text again and apply these edits:
This is not perfect but should work a bit better, with the downside that you might get duplicated output. That said it might not be an issue if you use a reasoning model.
If you’re using Claude Code, you can also set up a hook and pass it to grep or even llm haiku, as it’s super cheap and fairly fast.
I think this is a product of of AI coding, and I have an actual personal example of this. For nsh (AI-enabled shell overlay) I've been struggling recently with autorun mode (where it just runs the commands and doesn't pre-fill the shell or ask the user to execute). The AI coding tools I use seem to think that the way to do this is just be very stern in the prompt...but in reality it should programmatically override the tool call to enforce autorun. Given Anthropic's penchance for AI coding, I wonder if they haven't fallen victim to their own tool's bias?
Hah. I quiped something similar recently.
I think the goals of undercover seem reasonable from what I read (prevent internal stuff leaking out). But I did wonder if IP/attribution/ownership is a consideration as well?
Every rule I have read like this is not just reasonable but essential. It's just the way you program it in that is baffling to me, no wonder it is so easy to jailbreak these things.
I know you get bugs in code and unintended consequences but at least they can be tested and reproduced and understood. This feels like it could just fail for no reason
Funny that even Claude is not using Claude hooks (hard deterministic guarantees that something does or does not go through).
The loading spinner randomly selects from 186 verbs including "Clauding", "Flibbertigibbeting"
In early releases it would actually call haiku to update this and charge the user to make this single loading word be context aware.
I doubt it cost much in reality, but what a waste- choosing from a pool makes more sense.
FYI in case anyone else is as annoyed by these as I am, you can override the pool to a single option: https://github.com/anthropics/claude-code/issues/6814#issuecomment-3957595213
A quote I found about this: "It's like programming as understood by the sovereign citizen movement. Be sure to set the Admiralty Flag to unlock Undercover Mode!"
For non-US readers: the "sovereign citizen movement" is a conspiracy theory involving the US Constitution, the US flag, and who has jurisdiction for a (non)-crime. I've always found "sovereign citizens" to be very cringe myself.
I wonder how much of this was suggested by Claude. From my experience using it, it is very likely that the "codename canary" one is. We have a similar blacklist and Claude already tried to find ways to circumvent it.
It feels very bizzare that they are so concerned about, let me double check that, the INTERNAL CODENAME for an unreleased model?
I don't know why leaking a reference in the code to Opus/Sonnet 5/4.7 would be a bad thing either. All it tells us is that they're working on a new model (what a surprise, it's an AI company)
If ye does not heed these rules
Except the agreement here is terrible. It should be "if ye do not heed", or "if ye heed not", or "if thou dost not heed", or "if thou heedst not", and optionally replace "if" with "an" for extra flavor.
I bet Claude could do archaic English better.
Looks like the repo is gone, any mirrors?
Here's a breathless piece with all sorts of juicy details. There's a link to a gitlawb repo that seems active.
https://decrypt.co/362917/anthropic-accidentally-leaked-claude-code-source-internet-keeping-forever
DMCA takedowns failed as mirrors and clean-room rewrites spread instantly.
What a coincidence that all these clean-room rewrites happened the day the source code leaked.
"Clean-room rewrite" is a weird way of stating "I fed this into an LLM and asked for an output in another programming language".
Unless "someone else" fed it to the LLM to get a complete behaviour spec and you fed that spec to the LLM to get the rewrite...
Probably referring to this: https://malus.sh/
It's a statement both to the times we live in and the specific date today that I cannot determine if this site is serious or not.
The site has been making the rounds for about a week now, I think, which doesn't rule out some April 1 fun, but would be an uncommon practice for enjoyers of such jokes.
Your uncertainty is a shining example of Poe's Law.
they used the source code to train their model. it is vital for the progress of the art that they be allowed to do this.
The classic way to do a clean-room rewrite is to have one person examine the original program, write a spec, and then have another person implement the spec. Creating software by reading a spec isn't infringement regardless of whether the spec was made via traditional reverse engineering or by reading leaked source code.
The code mentioned in the blog post is still in the Git history of the repo: https://github.com/chatgptprojects/clear-code/tree/37f56bcbf0ae2ae98c7a147c5ac167d5121a30f5
I'm surprised that they're impressed by the Vim mode implementation. I'd think that this is one of the simplest things you can use an AI coding agent for: you have at least one working implementation (vim or neovim, I count them as one as they share a lot of code probably), lots of documentation, and tests that you can also use as a reference.