LLM APIs are a Synchronization Problem
13 points by jefftriplett
13 points by jefftriplett
As far as the core model is concerned, there’s no magical distinction between “user text” and “assistant text”—everything is just tokens. The only difference comes from special tokens and formatting that encode roles (system, user, assistant, tool), injected into the stream via the prompt template.
I used to believe this too, but it's false. Language models have a sense of anticipation which extends beyond their prediction of the next token; this is used to expect punctuation, estimate line lengths, and probably more. When we inject tokens, we disrupt this anticipation, leaving a distortion in the model state. Quoting Your LLM Knows The Future:
We start from the observation that vanilla, autoregressive, language models contain large amounts of information about future tokens beyond the immediate next token.
This means that the model has a general awareness of when we're doing RAG or other bits of Mad Libs on the prompt. Indeed, the model also has a pre-RLHF notion of the prompt, in the sense that it can tell when generated tokens are conditioned on the model state as opposed to being force-fed by an external driver. As a consequence, if we want a linear parseable narrative to emerge from the generated tokens then we must maintain that narrative ourselves with all of the tokens that we choose to construct the original context.
Aside: previously, on Lobsters, we noted that inference is injective on the model's state. This means that the model also has a sense of when it's been cold-booted, in that model states which are close to zero have similar trajectories over the first few tokens. Empirically, here's an experiment for local model runners: take the vector norm of the model state over the first few tokens, noting that it starts at zero and eventually stabilizes at a steady state. This means that a model statefully distinguishes between cold prompts and warm mid-conversation prompts.
If I were to have my LLM run locally on the same machine, there is still state to be maintained, but that state is very local to me. You’d maintain the conversation history as tokens in RAM, and the model would keep a derived “working state” on the GPU—mainly the attention key/value cache built from those tokens.
I think that the author doesn't do local models. The conversation history is not maintained in RAM, but is allowed to fade away once the model consumes it; the history is only interesting for humans to read at that point, not for the model's narrative. The working state is small enough that it doesn't have to stay on the GPU and is easy to serialize. The idea is that a conversation is wholly in the past and the only stored state is what the model needs to predict the future.
I used to believe this too, but it's false. Language models have a sense of anticipation which extends beyond their prediction of the next token
When I say this, I'm not claiming "the model is memoryless and only cares about the next token". I'm saying that, architecturally, the only thing the model ever sees is a sequence of tokens. Roles, tools and other stuff injected, they're all just different token patterns plus whatever conventions we layer on top.
The paper doesn't contradict that. It shows that the hidden state encodes information about multiple future tokens, which is interesting (and at this point somewhat well known, see for instance Karpaty's video of explaining how they write poems), but still entirely downstream of "you fed in a prefix of tokens, got a state, and predict the next one".
This means that the model has a general awareness of when we're doing RAG or other bits of Mad Libs on the prompt. Indeed, the model also has a pre-RLHF notion of the prompt
I don't believe that that follows from the paper. What we can safely say is:
Obviously there are nuances to this, but this general behavior is why I'm talking about synchronization. But going from that to "the model knows you're doing RAG / text injection" or "has a pre-RLHF notion of the prompt" is a big jump. From what I can tell nothing in that paper establishes this idea.
I think that the author doesn't do local models. The conversation history is not maintained in RAM, but is allowed to fade away once the model consumes it;
I do! In practice I keep both: the human-readable history (for logs, UI, debugging, evals), and some machine-friendly working state. That is the natural way of doing it and I doubt you will find many ways of people doing it differently. In particular if you go through LMStudio and Ollama, the interface they give you to you is again, a completion style API.
In general I think you are kind of arguing my point: this is a state problem and you have to manage and synchronize some notion of conversation state across client and server/GPU. One point that was brought up by Mario is that there is even additional state, that I did not talk about, that often needs to be dealt with: auxiliary files. The point of the blog post was to mostly reframe that conversation as a state synchronization problem to start an interesting conversation about how to best solve these problems.
// Regarding the hidden state I added a clarification to the post
The model's internal state encodes strong expectations about how the text will continue. If we insert in external text that doesn't match those expectations, we're changing that model's state in a non-trivial way.
Which implies that the model statistically detects when it's no longer inferring tokens. The model's low prior probability for force-fed tokens is precisely a high-perplexity situation. The non-trivial change is precisely that we magnify the otherwise-exponentially-unlikely paths which are associated with each low-probability force-fed token. Injectivity ensures that we cannot splice paths; there isn't a smooth deformation which eases from one context into another, but we must pick how the camera wipes.
I should probably clarify that I'm not anthropomorphizing. When I say "the model has a general awareness of X," what I mean is that there's a total function on the model state and next-token logits which decides whether X occurred. For whether RAG or other templating has occurred, the function is one of the perplexity metrics. For whether the model has recently been cold-booted from a near-zero state, the function is the Euclidean vector norm or something morally equivalent.
[Keeping human-readable history] is the natural way of doing it and I doubt you will find many ways of people doing it differently. In particular if you go through LMStudio and Ollama, the interface they give you to you is again, a completion style API. In general I think you are kind of arguing my point: this is a state problem and you have to manage and synchronize some notion of conversation state across client and server/GPU.
Ah! Okay, I think I understand now. So, I agree with your point in the context of how you use LLMs, yes, but two nuances are enough to invalidate it in general. First, I can't really empathize with these sorts of completion-oriented APIs. The harness that I'm currently retiring is a collection of Twisted Python scripts which call rwkv.cpp. It doesn't do completions or interactive prompting; instead, it constantly simulates an environment which isn't guaranteed to produce chatty outputs. There is no history; if I want a human-readable log then I can run the harness as a service and use my operating system's logging. So perhaps I'm not a standard user. Still, yes, because inference is a local event, all of the state must be local to inference; also, because the model detects interruptions and splicing in token streams, the serialized order in which token streams are submitted is significant and amounts to being able to take a lock when doing inference.
But I'm okay with that, because of my second nuance: hosted LLM services are near-useless because they can't possibly be selling what they offer. They claim to offer chatbots, but — as discussed previously, on Lobsters — LLMs can only simulate chatbots. Similarly, they don't offer coding assistance, but a simulacrum of assistance. The thinking is simulated thinking, of the sort that we use to teach students how to think. The conversations we have with such systems are necessarily useless in the real world; at most, they could be used as lipsum, but they are only simulations of conversations worth keeping. This is something I think both you and Mario overlook somewhat; when Mario says, "providers might inject additional tokens, but we don't have access to those anyway, so it doesn't matter;" clearly it does matter because of the order of injections! What's the point of synchronizing an ephemeral local inference?
Which implies that the model statistically detects when it's no longer inferring tokens. The model's low prior probability for force-fed tokens is precisely a high-perplexity situation. The non-trivial change is precisely that we magnify the otherwise-exponentially-unlikely paths which are associated with each low-probability force-fed token.
Potentially, but look at it like this: If you could re-feed exactly the same sequence of tokens the model saw originally (including any hidden/system/reasoning tokens), then recomputing from scratch vs resuming from a saved hidden state should land you in a rather similar internal state (not accounting for randomness).
With regards to the "model knows X happend" you're basically saying two things: 1) if perplexity is weird, that means RAG/unexpected insertion happened and b) if the hidden-state norm is small, that means we're near a cold boot. From my understanding high perplexity just means low probability, but that can generally also happen just with increased model temperature. It will also happen for a variety of other reasons, for instance because the user made a bunch of spelling mistakes etc. Neither perplexity nor how warm/cold hidden state is used by the model to influence its output.
That said, I feel like I'm defending something that I actually didn't even say because my whole point in the first place was that there is provider side state that is worth retaining.
This is something I think both you and Mario overlook somewhat; when Mario says, "providers might inject additional tokens, but we don't have access to those anyway, so it doesn't matter;" clearly it does matter because of the order of injections! What's the point of synchronizing an ephemeral local inference?
There's the theoretical aspect and there's the pragmatic one. Mario is quite pragmatic and said that model providers, given their incentives, are unlikely to change course in any meaningful way. I wasn't taking a position for or against that claim, though I do agree with Mario that it's probably unrealistic to expect major API changes from them that would support the users.
I don't think it changes anything substantial about what I wrote. My point was simply to outline the general idea that this is a state synchronization problem, which I still firmly believe. The rest feels like arguing over details that don't really move my core argument.
At least for the current state of affairs, where most LLMs have <=1MM context length, I think the standard chat completions API isn't that bad. It's inefficient from the sense of bytes transferred over the network, but that's not what's slow with long context: what's slow is computing the KV cache. In that sense, the "message history" is often actually used as a cache key in the short term — but it's a cache key that also allows you to transparently rebuild the cache from scratch if the cache gets wiped. This is how SGLang and vLLM work, for example.
A simpler API would be a Responses-style API... But one where the server guarantees it committed the write and will keep it in durable storage. Of course, this would be somewhat expensive, and also have privacy implications: you have to store everyone's data, forever, to have a conformant API. But OTOH, storage is cheap. Maybe it's fine? It would still probably increase API costs slightly though.
I really don't like the existing Responses API, though, since as the author mentions, there isn't a guarantee that the data will be stored forever, so you end up in this weird state where you're not sure if you can continue an old thread since the server may have discarded it.
I'm skeptical of the value of CRDTs here: those make sense when multiple machines can have canonically-accurate but divergent state and need to sync them. With LLM APIs, there's no way the client generate canonically-accurate responses from Claude/GPT-5.1/etc that aren't simply querying the remote server. There's no need to "sync": you just need a response from the server.
(author here)
I'm skeptical of the value of CRDTs here
To be clear I did not suggest CRDTs here. You're right that no client can locally regenerate an LLM answer without calling the provider.
Where I think the analogy still matters is that the model output itself is only one slice of the state, and it’s the least interesting one from a systems point of view:
I'm mostly arguing that once you look at the whole thing (client, backend, multiple models, tools, caches, etc.) you're firmly in distributed state territory. The more general point is that this feels a lot more like a sync/replication problem.