LLM APIs are a Synchronization Problem

13 points by jefftriplett


Corbin

As far as the core model is concerned, there’s no magical distinction between “user text” and “assistant text”—everything is just tokens. The only difference comes from special tokens and formatting that encode roles (system, user, assistant, tool), injected into the stream via the prompt template.

I used to believe this too, but it's false. Language models have a sense of anticipation which extends beyond their prediction of the next token; this is used to expect punctuation, estimate line lengths, and probably more. When we inject tokens, we disrupt this anticipation, leaving a distortion in the model state. Quoting Your LLM Knows The Future:

We start from the observation that vanilla, autoregressive, language models contain large amounts of information about future tokens beyond the immediate next token.

This means that the model has a general awareness of when we're doing RAG or other bits of Mad Libs on the prompt. Indeed, the model also has a pre-RLHF notion of the prompt, in the sense that it can tell when generated tokens are conditioned on the model state as opposed to being force-fed by an external driver. As a consequence, if we want a linear parseable narrative to emerge from the generated tokens then we must maintain that narrative ourselves with all of the tokens that we choose to construct the original context.

Aside: previously, on Lobsters, we noted that inference is injective on the model's state. This means that the model also has a sense of when it's been cold-booted, in that model states which are close to zero have similar trajectories over the first few tokens. Empirically, here's an experiment for local model runners: take the vector norm of the model state over the first few tokens, noting that it starts at zero and eventually stabilizes at a steady state. This means that a model statefully distinguishes between cold prompts and warm mid-conversation prompts.

If I were to have my LLM run locally on the same machine, there is still state to be maintained, but that state is very local to me. You’d maintain the conversation history as tokens in RAM, and the model would keep a derived “working state” on the GPU—mainly the attention key/value cache built from those tokens.

I think that the author doesn't do local models. The conversation history is not maintained in RAM, but is allowed to fade away once the model consumes it; the history is only interesting for humans to read at that point, not for the model's narrative. The working state is small enough that it doesn't have to stay on the GPU and is easy to serialize. The idea is that a conversation is wholly in the past and the only stored state is what the model needs to predict the future.