LLM-assisted coding is not deterministic. Does it matter?
6 points by vrypan
6 points by vrypan
This roughly aligns with my experience in both writing as well as analyzing code. Finding objective criteria (e.g., tests) can provide a stop gap for slop and hallucinations while only allowing valuable code to go through. Also helps focus user/human effort in what is truly valuable.
Physics isn't deterministic. You were closest when you mentioned that chaotic systems diverge at an exponential rate from initial conditions; autoregressive language models diverge at a rate that is exponential per token. There is a barrier for each of the systems you've listed beyond which our predictions degrade into noise regardless of the quality of the underlying model: solar-system orbits are only predictable for about ten megayears, weather patterns for about two weeks, and dice (vigorously shaken in a cup) for about ten seconds.
How fast do LLMs diverge? That would be a very interesting question to study! Previously, on Lobsters, we noted that LLMs are (among other properties) sensitive to variance in initial conditions, so there ought to be an empirically-measurable Lyapunov action of some sort, but I can't find any papers which give precise numerical estimates. Personal experiments suggest divergence is at its worst after as few as ten thousand generated tokens on 3.7B parameters. Of course, force-feeding tokens from any sort of steering, harness, or conversation will more-or-less reset the state to new initial conditions, so this is a surmountable architectural obstacle.
I agree. But the point is that neither humans nor LLMs are deterministic, so why is this an argument against LLMs?
What we really care about is predictability. For example, knowing that after 10k tokens an agent tends to diverge is good for predictability. It gives a bound where one can feel safe to use, and develop strategies to deal with it. Similar to how managing a dev team and knowing that after 20h of straight coding, their outcome will diverge from their normal quality. :-)
When you use a typical autocomplete in an ide, write 3 letters and you'll see a list of the available functions that start with those letters. If you write them again, you'll get the same list (maybe reordered with some LRU or something). When you yse an LLM to help you complete, you might get the function you needed based on the context, or a python script that will download a library to analyze your code and print a list of cities that share the same letters. Or maybe a foreach loop, but you wanted to write 'fortune'
Monte Carlo would be a bad implementation for autocomplete. Does that make Monte Carlo a bad tool? Is genetic programming useless?
Different tools are useful for different things and it takes skill to wield a powerful non-deterministic tool appropriately. Part of holding LLMs correctly is not picking them when you want simple deterministic results like autocomplete.
I think nondeterminism is a good argument for refusing to use an LLM for the same tasks as other software. The reasons to refuse to use an LLM for the same tasks as humans are more like lack of accountability or lack of mental processes crucial to some particular task or ethical concerns about the impact of replacing a class of human laborers. There can be many criticisms of how people use the technology, without all of them applying to all cases.
I was focused on programming where until now there was no other software to write it.
The post is not an argument why we should or should not let LLMs write code, but that the argument "I don't want LLMs to write code because they are not deterministic" is weak because a) the alternative, people, are also non-deterministic, and b) it's not what matters. There are good arguments for and against human and LLM coders depending on the case, determinism is not one of them imo.
Oh, I was replying to the article without replying directly to the point. I would suggest that LLMs are bad at writing, and we should reject bad writing regardless of whether it's predictably bad, deterministically bad, generated by chatbots, etc. I think that your point is mostly unsupported by the evidence presented; once your first table is undermined, it's hard to justify your second table. For example, expecting this to get a bit meta, what's your source for the following claims?
Weather is a good example. The laws of physics governing the atmosphere have not changed, and they are deterministic. Yet our ability to predict the weather has improved over decades simply because our measurements, models, and computing power improved.
The standard understanding is that the laws of physics aren't sufficient to predict weather, requiring differential equations which estimate it thermodynamically. These equations aren't deterministic, and the underlying physics isn't deterministic either. Our ability to predict the weather is based on the 1960s paradigm of numerical prediction with an ensemble of initial conditions which mitigate chaos. While computing power improved, it was mostly spent on fine-grained measurements, allowing weather predictions which detail individual hours rather than individual days. There's a mathematical reason for this, quoting Wikipedia:
A more fundamental problem lies in the chaotic nature of the partial differential equations that describe the atmosphere. It is impossible to solve these equations exactly, and small errors grow with time (doubling about every five days). Present understanding is that this chaotic behavior limits accurate forecasts to about 14 days even with accurate input data and a flawless model.
"doubling about every five days" is an indicator of the numerical value of the Lyapunov exponent.
Weather was used as an example of how "predictability, often depends on our capabilities". We are better at predicting weather today than we were 100 years ago, not because the laws that govern it changed, but because we got better at it.
Weather is such a complex phenomenon that can be studied at all levels, from micro to macro, so maybe it was not an ideal example. But do you disagree that determinism and predictability are two different things?
Sure! But LLMs are classically computed and chaotic, so they are deterministic and not predictable, like the pixels of visualizations of the Mandelbrot set. This is why I'm asking about the provenance of your claims; whoever told you that weather is deterministic was misleading you and may have misled you in other ways.
Sure!
This is the important point.
The post was written for people who think that determinism guarantees predictability. Not intended to exhaust the concepts or stand as an academic paper. When you use real world examples to discuss determinism, you are eventually going to make fouls, or get in deep discussions about the nature of physics and the nature of the universe.
I think there are two concepts that are worth distinguishing:
Weather forecasting sits squarely in the second category — the Navier–Stokes equations are deterministic PDEs, and prediction horizons come from chaos, not from any indeterminism in atmospheric physics.
Anyways, I'm not sure that any of what I said applies to the discussion of LLMs, but the physicist in me felt compelled to address the analogy.
Worth checking the concept of computational irreducibility, which is deterministic, non-predictable but not chaotic if you're into these things.
One thing that I don't understand, is why LLMs are not deterministic.
Take a LLM that has been trained with a data set, if I ask it a question, it will reply differently every time. Surely if the data set is fixed, the answer should be the same every time? Does anyone know why?
The first part of this is that they're made to be non-deterministic - that's the temperature/ top p/ top k parameters. You want a little bit of randomness so the output isn't just the most likely thing, but a probabilistic mix of what the most likely tokens are. This produces more creative output, that's why the oft-quoted advice of using temperature >1 for creative writing.
Even if you set temperature to 0 (not all models let you), you would get a somewhat deterministic result but not 100%. We set temperature 0 at work and I've seen this in practice. I'm not an ML researcher so my understanding of this is fuzzy, but what I do know is that this comes from floating point arithmetic and the underlying hardware creating subtle shifts that compound over time.