How I used o3 to find CVE-2025-37899, a remote zeroday vulnerability in the Linux kernel’s SMB implementation
26 points by ohrv
26 points by ohrv
One of the few healthy perspectives on AI, wish there was more ai tooling for careful experiments like this. Here he sets up a controlled benchmark, comes up with an algorithm to feed context into an llm and then manually verifies the result. Would be nice to have a cli or UI tool to quickly setup such pipelines and to efficiently review the results.
Nice fresh breath of air over all the “I let agents loose on my code”. I think it’s a good bet that AGI isn’t gonna happen on LLM tech and that significant value will be carefully extracted via thoughtful human-in-the-loop pipelines like this.
o3 finds the kerberos authentication vulnerability in the benchmark in 8 of the 100 runs. In another 66 of the runs o3 concludes there is no bug present in the code (false negatives), and the remaining 28 reports are false positives.
8% success rate, where the failures are split 1/3 false positive and 2/3 false negatives.
I see we’re using the word “find” very loosely.
[when providing larger input file] o3 finds the kerberos authentication vulnerability in 1 out of 100 runs with this larger number of input tokens, so a clear drop in performance, but it does still find it.
again, “find” is doing a lot of lifting here.
I don’t know what I was expecting, reading this… but it wasn’t clickbait as I’ve now categorized this post/author.
I notice many people discussing this post without knowledge of the fact that he had the LLM discover a vulnerability that he personally discovered without an LLM previously.
This is similar to how you can get an LLM to give you the right answer to a difficult question that you know the answer to by tweaking the prompt a few times til it says what you want
Not quite, he was trying to see if the LLM could find a vulnerability that he had found previously, but it found a different one to the one he found.
Let’s first discuss CVE-2025-37778, a vulnerability that I found manually and which I was using as a benchmark for o3’s capabilities when it found the zeroday, CVE-2025-37899.
I appreciate the correction! I’m a bit fatigued with the pattern of people misrepresenting things LLMs do, so I managed to miss that.
This is fuzzing but with tremendously more power use?
Not really, instead of generating inputs that generate bad behaviour (e.g. sanitiser violations, crashes), this is generating lots of bad reasons/“proofs” why the code is unsafe, and sometimes being right (1-8% of the time).
It’s an impressive result that it’s able to reproduce finding the original CVE, very recently ~1st of May, so it’s likely not just regurgitating something in the training set. Then to find another novel one! However, it is hidden between lots of incorrect results, which undoubtedly requires someone already skilled enough (like the author) to be able to sift through them to intuit which ones are even right/worthy of more investigation.
It shouldn’t be outside the realm of possibility that LLM could work off each “proof” and try and generate a reproducing example, at which point could be verifiably true. That probably would require a lot of attempts (unless the reasoning ability gets a lot better), and as you say, burn a lot of power.
Fuzzers themselves I would imagine are quite power consumptive anyway, right? Finding bad inputs is really a needle in a haystack game, and the input domains are quite large. I think the large or small energy consumption using either method is A Good Cause, and shouldn’t dissuade us from using them. Using AI in tandem with fuzzers is already a thing, and just like pruning in chess engines, cutting off likely-non-fruitful paths using a imperfect heuristic still yields impressive results.
Fuzzing has a better reward function. You can mutate/generate inputs based on code coverage. The LLM is mostly in a theoretical “anything goes” space
That was not the author’s take, at least:
Considering the attributes of creativity, flexibility, and generality, LLMs are far more similar to a human code auditor than they are to symbolic execution, abstract interpretation or fuzzing