Vulnerability Reports Are Not Special Anymore

29 points by orib

dmbaturin

I believe premature disclosure still helps attackers a lot more than it helps anyone else because from the experience I've had and seen, even with frontier models, the rate of false positives still can be close to 90%.

Tangentially:

we believe the sustainable maintenance and development of open source cryptographic protocols is critical to the broad adoption of blockchain technology

I'm surprised anyone still believes in blockchain technology.

rbr

What do you mean with false positives? Is it, model claims it found a vulnerability, and when you go look it turns out it's not a problem? I thought at this point it was pretty well established that finding vulnerabilities (and code comprehension in general) was the one thing LLMs are good at. Is that actually true? Can you elaborate on your experience, or maybe point me towards some sources? Genuinely curious if I have to adjust my assumptions. (I don't use LLMs so it's hard for me to get an idea of how good they are these days)
- spc476
  
  It could be the model (of course). Last month I had an LLM foisting a false positive on one of my repos. Of course, it apologized and then apologized for "fixing" the C code with Python. It was an absolute joke.
- bakkot
  
  LLMs are very good at finding vulnerabilities (among other things) but they are also prone to false positives. Both can be true: a reporter who sends you 100 reports, of which 15 are true vulnerabilities, is very good at finding vulnerabilities (assuming you've put any effort in to securing your code) - but it would be very annoying to deal with them.
  
  The above claim is just based on my own experience, but here's some data from the curl project, discussing how AI-based reports have improved this year vs last: "The rate of confirmed vulnerabilities is back to and even surpassing the 2024 pre-AI level, meaning somewhere in the 15-16% range." So, only 1/6 is a real vulnerability - but that's still a lot!
  
  Incidentally, one of the things which made Mythos/Fable more useful here was that it was better at putting together working PoCs, which makes identifying true positives much easier.
freddyb

Special vulnerability reports should be treated as special and it is on the defender to work on better verification and published threat models such that people can meet (and verify) a new, higher bar for what constitutes a great report.
- FiloSottile
  
  Yeah that might be where things land in the end: vulnerability reports are not special in general, but some high severity and/or high trust ones are special vulnerability reports.
Shorden

I agree that finding security issues is now easier and due to the barrier to entry being lowered I'm sure there's more noise in security mailing lists than in previous years. That having been said, I absolutely would still prioritize bug reports from respected consultancies like Trail of Bits or Zellic—not only because I trust them not to submit slop reports, but because I think that top security researchers (with LLMs) beat just running an LLM in CI.
- samuelkarp
  
  bug reports from respected consultancies like Trail of Bits or Zellic—not only because I trust them not to submit slop reports
  
  Having (recently) worked with vendors like this: the slop still happens. It just comes with higher-quality reports, but an LLM can misunderstand threat models and decide to be lazy about how it demonstrates an "exploit" no matter who is steering it, and if the security researcher isn't watching out for these missteps it makes its way to us as maintainers. containerd has received slop reports from vendors like this, LLM research companies (the same foundational model companies you know about already), and J Random User on the internet.
  
  Duplication is another challenge Filippo didn't quite state here. If you look at containerd's most recent advisories another significant problem for triage/attention is duplication: you'll see we've credited 9 separate researchers/groups including respected groups like the ones you mentioned. To me, that shows (a) LLMs find a lot of the same issues regardless of who is using them, and (b) there's not necessarily much special about reports from known reporters. In contrast, this one was only credited to one reporter because we didn't actually get any duplicates, and this reporter is not one we had any prior knowledge of or relationship with.