You don't want long-lived keys

11 points by vrolfs

algesten

Just last few weeks we've shipped an MCP tool. According to spec, MCP uses OAuth and are encouraged to use refresh_token and short lived credentials. That failed straight away – turns out none of the larger Code Assistants (Claude Code, Cursor etc), handle refresh_token AT ALL. The users keeps getting logged out and have to redo the auth/consent flow from scratch.

So, against my will and better judgement, we had to ship long lived tokens :(

lojikil

Not a bad commentary at all, but the implicit understanding underpinning this is you actually know the full lifecycle of all of your cryptographic material. My usual interaction with cryptosystems is less on the internals and usually on the selection & compliance side; most people get the lifecycle wrong somewhere in their constructions.

So, if I were to add a prerequisite to the blogpost: please understand all the places that you create, use, and dispose of keys.

doctor_eval

The article seems to mix keys and tokens. I’m working on exactly this problem - using signed tokens - at the moment, for a new infra build. I’ve come to the conclusion that it will be very difficult to avoid long lived tokens for services, so instead I have created a simple revocation and logging mechanism, and in particular, unused keys just stop working after 45 days.

The problem I’ve got with key/token rotation for services is that it adds a lot of complexity, and in the end you still need to rely on some signing mechanism, which for regular automated rotation becomes just another attack surface in the stack.

So instead, we have a policy of one token per service, token creation is audited, tokens have limited scope, use is logged, they are revocable, they can get stale - and we no longer need automated signing.

I’m not an expert in security so maybe someone can tell me what I’m doing wrong. But as I get older, I feel that a lot of advice about security is just moving pieces around the board.

For example, AFAIK a refresh token mentioned in these comments provides no more security than an access token if revocation tests are performed at time of use. The refresh token is just a long lived access token proxy.

While the advice to exercise the key rotation muscles is sound, I always struggle to reconcile this advice with the level of complexity it creates, and I really hate complexity.

zimpenfish

[100% not any kind of expert - please take with a pinch of salt]

I think the point of the access + refresh pair is that you make your access tokens short-lived enough that you don't need revocation checks on each use (just rely on the expiry) and you shift those revocation checks to the refresh token - which for a big service could be a huge saving if, e.g., an access token gets used 1000x an hour and your refresh token is only used 1x an hour.

Yes, you now have a blast radius of up to [expiry] for a compromised token but on the other hand, your database[0] isn't melting with revocation checks.

[0] Yes, I know Redis exists, the point stands.
- doctor_eval
  
  Oh yes I totally understand about that but in reality many systems do revocation checks on the access tokens anyway (ORY for example does this IIRC). In any case, my principal is that every second that we know a token is compromised and don’t revoke it reflects poorly on me and my team.
  
  Anyway - where I Ianded was an allowlist for token IDs which is stored in a NATS KV. The KV is updated with context whenever it’s used. It’s plenty performant enough*, and we only need to do the check on long lived keys, since short lived keys by definition have other revocation mechanisms. And NATS KVs have TTLs which gives us token expiration for free.
  
  This eliminates the client side complexity of refresh tokens, with all the benefits I mentioned previously. And we can see which tokens are in use and when they were last used.
  
  * we’re talking 10k+ token checks/second
  - zimpenfish
    
    many systems do revocation checks on the access tokens anyway
    
    Then they don't need the refresh tokens, really.
    
    we’re talking 10k+ token checks/second
    
    Sure but at FAANG scale, anything you can shave off the response time (even if it's just a handful of ms), internal traffic and storage (even if it's just a handful of bytes) is going to add up fast.
    
    (I wouldn't bother with access+refresh/OAuth myself but apart from a short stint at Yahoo!, I've thankfully stayed away from anything high traffic.)