Lobsters: Can LLMs SAT?

gignico

The simple answer is of course "no" because SAT is NP-complete and the inferring process of LLMs does not take exponential time over the length of the prompt. So, unless P = NP, any LLM can only either guess the results blindly (i.e. not with a specific probability of success greater than 0.5, because SAT is not known to be in BPP either) or tell the correct answer only for a fixed subset of learned examples.

Funnily enough this is what is happening with the different LLMs tested in the post. GPT 5.2 is randomly guessing correctly half of the inputs, and GPT 5.2 mini works under 100 clauses.

hwayne

I don't think this is true? The inferring process is roughly quadratic per output token, but then the LLM reruns the inference process on the input + output to produce the second output token, and then again for the third, etc. So they can produce exponentially more output than the original input.
- gignico
  
  What you said is correct, I need to think about this a bit more
hwayne

Do they get any better or worse if you don't convert the problems to CNF?
- aiono
  
  I didn't test this myself, but according to this paper every model perform better on direct SAT, though the difference didn't seem too much to me. Note that they use older models like gpt-4o.