Reading leaked Claude Code source code

82 points by lr0

benoliver999

The idea that these things are configured by just prompting them is scary to me.

You are operating UNDERCOVER in a PUBLIC/OPEN-SOURCE repository. Your commit
messages, PR titles, and PR bodies MUST NOT contain ANY Anthropic-internal
information. Do not blow your cover.

NEVER include in commit messages or PR descriptions:
- Internal model codenames (animal names like Capybara, Tengu, etc.)
- Unreleased model version numbers (e.g., opus-4-7, sonnet-4-8)
- Internal repo or project names
- Internal tooling, Slack channels, or short links (e.g., go/cc, #claude-code-...)
- The phrase "Claude Code" or any mention that you are an AI
- Co-Authored-By lines or any other attribution

It's like giving wishes to a genie - there's always a loophole

Corbin

It works because of statistics; the undesired behaviors are less likely when conditioned upon this sort of prompt. That said, surely we can imagine even better prompts for conditioning which would work even better? The second-person pronoun "you" is such a bad choice; it contributes to the schizophrenic inability of chatbots to reliably identify chat participants.
- mikedorf
  
  Someone should invent an unambiguous way of encoding instructions for a machine to follow.
  - runxiyu
    
    Exactly! /s
- csomar
  
  I have been using Claude (the web platform) to rewrite English text. It's pretty good, however, it always include em-dashes; which I hate.
  
  Here is my prompt
  
  Improve my English writing. You'll be given text that I expect you to re-write in a better style. - Do not use em-dashes. DO NOT USE em-dashes. - Fix markdown where applicable - Be a human. Do not sound like an AI/LLM. DO NOT USE EM-DASHES. DO NOT USE EM-DASHES.
  
  9/10 I still get em-dashes with this prompt.
  - thisalex
    
    That’s a known issue: say what to do, do not say what not to do. Say “editor’s requirement: use short dashes instead of proper typographic ones”
    
    migurski
    
    Do not think of an elephant. DO NOT think of an elephant!
    
    csomar
    
    I'm well aware that LLMs respond poorly to negative instructions. My point in adding them is to see what actually gets them back into compliance; which, based on my own testing, seems to be nothing.
  - creesch
    
    Looking at your prompt, in my experienced it is somewhat ambiguous. In my experience adding examples helps. As silly as it sounds it means adding prompt lines like "Don't use Em dashes (—) and En dashes (–) in writing". To be clear, you are not going to get a perfect result, but it will be better.
    
    To the greater point, the fact they are still trying to configure things through prompting is interesting to say the least.
    
    gerikson
    
    To the greater point, the fact they are still trying to configure things through prompting is interesting to say the least.
    
    What's the alternative?
    
    creesch
    
    Dunno, I am not a Anthropic employed engineer. The alternative is guarding against it outside of the LLM. For example, this seems what they have done with the sentiment regex for frustrated responses.
    
    Considering that they are trying to prevent leaks of specific things I wonder why having simple word matches on output are not used. At the very least as a last sanity check.
    
    Anyway that is all besides the point for why I said it is interesting. What is interesting, to me, is that for the most part the control we have over models still comes down to asking them to not do things and hope for the best.
    
    gnyeki
    
    I wonder why having simple word matches on output are not used. At the very least as a last sanity check.
    
    Unfortunately, it is hard to patch up holes this way too. For example, models understand prompts in base64 and can respond in base64, and an adversary (in this case an adversarial open source project?) can come up with creative variations of this approach.
    
    For a thinking model, checks could be applied against the "thinking" output since that is probably in English, but these attempts are also probabilistic like the prompt massaging in the Claude Code codebase.
    
    Corbin
    
    At home, I've verified that soft prompting, grammars, and steering vectors are all effective.
    
    rafpast
    
    (I know you didn’t ask for it, but it made me think)
    
    Try adding something along these lines at the end of the prompt:
    
    Once you’re done, edit the text again and apply these edits:
    
    replace all em dashes (—) with hyphens (-)
    
    remove all emojis
    
    This is not perfect but should work a bit better, with the downside that you might get duplicated output. That said it might not be an issue if you use a reasoning model.
    
    If you’re using Claude Code, you can also set up a hook and pass it to grep or even llm haiku, as it’s super cheap and fairly fast.
    
    benmoss
    
    If I had to guess - I don’t think hyphens are an improvement over em—dashes
    
    fluffypony
    
    I think this is a product of of AI coding, and I have an actual personal example of this. For nsh (AI-enabled shell overlay) I've been struggling recently with autorun mode (where it just runs the commands and doesn't pre-fill the shell or ask the user to execute). The AI coding tools I use seem to think that the way to do this is just be very stern in the prompt...but in reality it should programmatically override the tool call to enforce autorun. Given Anthropic's penchance for AI coding, I wonder if they haven't fallen victim to their own tool's bias?
    
    jonathannen
    
    Hah. I quiped something similar recently.
    
    I think the goals of undercover seem reasonable from what I read (prevent internal stuff leaking out). But I did wonder if IP/attribution/ownership is a consideration as well?
    
    benoliver999
    
    Every rule I have read like this is not just reasonable but essential. It's just the way you program it in that is baffling to me, no wonder it is so easy to jailbreak these things.
    
    I know you get bugs in code and unintended consequences but at least they can be tested and reproduced and understood. This feels like it could just fail for no reason
    
    janiczek
    
    Funny that even Claude is not using Claude hooks (hard deterministic guarantees that something does or does not go through).
  - jasonjmcghee
    
    The loading spinner randomly selects from 186 verbs including "Clauding", "Flibbertigibbeting"
    
    In early releases it would actually call haiku to update this and charge the user to make this single loading word be context aware.
    
    I doubt it cost much in reality, but what a waste- choosing from a pool makes more sense.
    
    dvogel
    
    FYI in case anyone else is as annoyed by these as I am, you can override the pool to a single option: https://github.com/anthropics/claude-code/issues/6814#issuecomment-3957595213
  - spc476
    
    A quote I found about this: "It's like programming as understood by the sovereign citizen movement. Be sure to set the Admiralty Flag to unlock Undercover Mode!"
    
    For non-US readers: the "sovereign citizen movement" is a conspiracy theory involving the US Constitution, the US flag, and who has jurisdiction for a (non)-crime. I've always found "sovereign citizens" to be very cringe myself.
  - catwell
    
    I wonder how much of this was suggested by Claude. From my experience using it, it is very likely that the "codename canary" one is. We have a similar blacklist and Claude already tried to find ways to circumvent it.
  - hjvt
    
    It feels very bizzare that they are so concerned about, let me double check that, the INTERNAL CODENAME for an unreleased model?
    
    mudkip
    
    I don't know why leaking a reference in the code to Opus/Sonnet 5/4.7 would be a bad thing either. All it tells us is that they're working on a new model (what a surprise, it's an AI company)
  - mdaniel
    
    Every developer who uses it must type out "I_VERIFIED_THIS_IS_NOT_CODE_OR_FILEPATHS" as part of their code.
    
    pfft, as if developers "type" anymore. ok, grandpa
    
    :-D
    
    catwell
    
    I think doing this also helps agents check it. This may be the reason they did it in the first place.
  - hobbified
    
    If ye does not heed these rules
    
    Except the agreement here is terrible. It should be "if ye do not heed", or "if ye heed not", or "if thou dost not heed", or "if thou heedst not", and optionally replace "if" with "an" for extra flavor.
    
    I bet Claude could do archaic English better.
  - landon
    
    Looks like the repo is gone, any mirrors?
    
    gerikson
    
    Here's a breathless piece with all sorts of juicy details. There's a link to a gitlawb repo that seems active.
    
    https://decrypt.co/362917/anthropic-accidentally-leaked-claude-code-source-internet-keeping-forever
    
    fleebee
    
    DMCA takedowns failed as mirrors and clean-room rewrites spread instantly.
    
    What a coincidence that all these clean-room rewrites happened the day the source code leaked.
    
    gerikson
    
    "Clean-room rewrite" is a weird way of stating "I fed this into an LLM and asked for an output in another programming language".
    
    kitkat
    
    I've always been under the impression that "cleanroom" indicates no knowledge of the original code/architecture, only its observable behavior (i.e. not this)
    
    hjvt
    
    Somehow the argument has been that this doesn't apply to LLMs because reasons, I guess.
    
    viraptor
    
    Unless "someone else" fed it to the LLM to get a complete behaviour spec and you fed that spec to the LLM to get the rewrite...
    
    vfoley
    
    Probably referring to this: https://malus.sh/
    
    gerikson
    
    It's a statement both to the times we live in and the specific date today that I cannot determine if this site is serious or not.
    
    hoistbypetard
    
    The site has been making the rounds for about a week now, I think, which doesn't rule out some April 1 fun, but would be an uncommon practice for enjoyers of such jokes.
    
    Your uncertainty is a shining example of Poe's Law.
    
    ocramz
    
    Wonder how many copies of itself does CC pass by these days.
    
    zem
    
    they used the source code to train their model. it is vital for the progress of the art that they be allowed to do this.
    
    lonjil
    
    The classic way to do a clean-room rewrite is to have one person examine the original program, write a spec, and then have another person implement the spec. Creating software by reading a spec isn't infringement regardless of whether the spec was made via traditional reverse engineering or by reading leaked source code.
    
    icy
    
    https://tangled.org/oppi.li/claude-code
    
    leela
    
    The code mentioned in the blog post is still in the Git history of the repo: https://github.com/chatgptprojects/clear-code/tree/37f56bcbf0ae2ae98c7a147c5ac167d5121a30f5
    
    mudkip
    
    It's also still available in another branch
    
    osa1
    
    I'm surprised that they're impressed by the Vim mode implementation. I'd think that this is one of the simplest things you can use an AI coding agent for: you have at least one working implementation (vim or neovim, I count them as one as they share a lot of code probably), lots of documentation, and tests that you can also use as a reference.
    
    gnyeki
    
    Comment removed by author