LLM Reviews in cargo-crev

14 points by dpc_pw

fleebee

An LLM can easily and reliably check if a code version published on https://crates.io matches the code published in git.

Do you really need an LLM for that?

joshka

I suspect what this is saying is more like:

There are a bunch of ways that things can be slightly different but correct. And a bunch of semi-deterministic ways to do things around this space. An LLM makes it easier to collapse such things into something which can be reasonably validated with minimal effort / additional manual work.
- dpc_pw
  
  Exactly. It's unfortunately very common that people publishing have minor mismatches, forgot to tag, etc. LLM can dive as deep as needed, including running full diff and analyzing the difference to verify it was benign. And then LLM can check bunch of other things while at it.
moltonel

You don't, but it's not yet as straightforward as it should be.

https://github.com/M4SS-Code/cargo-goggles https://internals.rust-lang.org/t/verifying-that-crate-files-match-the-git-repository/20587/34 https://lawngno.me/blog/2024/06/10/divine-provenance.html
- fleebee
  
  Thanks for these. That GitHub link is a 404 to me, but the last link was especially helpful in understanding why this is a problem in the first place.
  - moltonel
    
    Fixed github link, thanks
- ssokolow
  
  Is it wrong that, even after reading this, what comes to mind is cURL ending their bug bounty program due to AI slop and "Thank goodness cargo-crev is a web-of-trust design"?
  
  It really is starting to feel like we're back in the 90s when you had to rely on manually curated community indexes/lists to find good sites, but with the curated indexes lost in the haze.
  - dpc_pw
    
    I am not sure what do you mean.
    
    Personally I believed that WoTs are the only practical way forward for scaling trust back in 2018 when cargo-crev was started, and I am even more convinced of it now with the raise of slop.
    
    The community indexes/lists of the 90s were just an instance of a primitive, centralized WoT.
    
    ssokolow
    
    The first paragraph (i.e. "is it wrong...?") is "Is it wrong that, even after reading this, my intuitive reaction is that this will be a net negative?"
    
    The second paragraph is more a general observation of the state of the Internet in the 2020s. I recently discovered Splitting the Web and, while I certainly feel the thesis is correct, it also feels like the WoT for the indieweb is anchored in social media (eg. Mastodon), forums (eg. VOGONS, TinkerDifferent, etc.), and Discord or Matrix "servers" and I've struggled all my life with having only two settings for those forms of communication: Attempt to drink from the firehose or only pay attention to PMs/DMs.
    
    (The reason I was so active on /r/rust/ during Rust's first decade of existence was because I was successfully drinking from the /r/rust/ RSS feed firehose.)
  - kornel
    
    I'm dismayed how little code reviews are there. There might be more blog posts published about the importance of supply chain security than the amount of code reviews we have.
    
    cargo-crev's UI isn't the best (I've contributed a few things to make it easier), but there's also cargo-vet which has lower barrier to entry. But still, approximately nobody seems to be reviewing crates. There are 60,000+ accounts that published a crate, but ~100 kicked the tires on cargo-crev, and in reality it's just a handful of people reviewing a trace amount of crates. cargo-vet also rests of shoulders of a few people. Meanwhile crates.io downloads are growing exponentially.
    
    So even though I think LLMs are too gullible and imprecise to review adversarial code, we're screwed either way.
    
    chrismorgan
    
    I have a plan. I’ve been working toward it on and off for quite a few years (ugh, https://git.chrismorgan.info/crev-proofs was four years ago, and I was doing private reviews with ideas in this direction eight years ago), but am currently actively working on it and I currently think I’ll get to the relevant stage later this year.
    
    Replace my too-rigid website with one in a new paradigm, encouraging more writing and publishing. (I hope to deploy on Wednesday, though it will be a long way from complete, for my vision is expansive. Early content will focus more on HTML and CSS than Rust. I’ll also be including things of debatable commercial value like a synthesised pipe organ.)
    
    Then, start publishing detailed reviews of code of various kinds. The good things. The bad things. Concerns. Opinions. Alternatives. Detailed explanations of why such-and-such is wrong or inferior, accumulating pages for common problems or opinions, to link to. Comparisons between multiple similar libraries. And so on. I’ve done some such reviews before especially on CSS frameworks on HN and now here (e.g. one quick one here a few days ago for dev.css), and it’s generally very well received—half of my best-voted comments on HN would be that sort of thing, at a guess. (My top comment on HN by a landslide complaining about stale bots.)
    
    (Aside/meta: on sites like these, I want a “my top comments” view. I spend a lot of time writing some content, and would like to go back and remould some of it into my own website. On HN, the threads page is paginated, so you can tediously scrape all your comments, and then sort them. Here, I don’t see any practical way of downloading all my comments. My top comment on HN, incidentally, rails against the stupidity that is stale bots on bug trackers, +287.)
    
    My purpose, when doing these sorts of reviews, is not just to say “yes, this library is safe to use”, but to teach. I love teaching. Show people alternatives, better ways of doing things, nifty tricks. Make it so that some normal developers will want to read my code reviews. Build an audience.
    
    Then steadily solicit suggestions or requests. “Can you review this library”, that kind of thing. Also branch out into different media a bit: video reviews and live-streaming are interesting in their own right. (My website will also be integrating handwriting over time, which is overlooked.)
    
    Finally, the thing to make it sustainable: money. The audience has already been built along the way. Now solicit payment for reviews (to cover things I wouldn’t otherwise, or to climb the queue): your code, or your dependencies, advice on which library to choose for X, &c.. And aim to have reviews public, though private review at commercial rates will also be available. Try to get subscriptions, a retainer sort of arrangement.
    
    I’ll see how it all goes. And if anyone else wants to do something similar or take inspiration from this, go ahead.
    
    dpc_pw
    
    Then, start publishing detailed reviews of code of various kinds. The good things. The bad things. Concerns. Opinions. Alternatives.
    
    There might be something there. Maybe it's possible to make chatting about code under review socially an interesting experience, in a similar way talking about news and arguing in the comments is.
    
    dpc_pw
    
    I'm dismayed how little code reviews are there.
    
    Well, it is time consuming unpaid work, in a world where devs are already struggling just maintaining what they have wrote with whole world largely feeling entitled to free software, and often behaving rude about it. I can't really blame individual Open Source developers, maintainers, contributors and users.
    
    I still think that commercial entities, as being primarily interested in not getting p0wned in a freak supply chain accidents should step up. It's often infeasible for them to sponsor all the little FOSS work they greatly benefit from, so they could at least do something about ensuring their own products are secure, and share that with the wider community.
    
    Hopefully LLM reviews in cargo-crev (and possibly other similar tools) would allow them to easily turn their LLM budgets into supply chain security.
    
    So even though I think LLMs are too gullible and imprecise to review adversarial code, we're screwed either way.
    
    I think in principle it is possible to have LLM review adversarial code. But it will require special preparation (prompts, review loops, techniques), and this is currently out of the scope of what I'm hoping to achieve.
    
    I am a strong believer in a 80/20 rule. Finding >80% of exploits can be done with a 20% effort. Stealing keys, crypto miners, etc. stand out and should be relatively easy to identify, standing out, especially in Rust. IMO, consumer grade LLMs in an unsophisticated setup should be able to review 100% of small projects dependencies, and with some WoT wins, the work can be spread out and shared.
    
    Or at least it's worth giving a try.
    
    bitshift
    
    it is time consuming unpaid work, in a world where devs are already struggling
    
    Not only is it time-consuming and unpaid, it's also work that can feel unimportant. Code review over an entire software ecosystem would be really valuable in aggregate, but if I were to pick one crate out of my dependencies and review it, there's a >99% chance the crate would be completely fine, and I would have "wasted" an hour. Now logically, I know it's not a complete "waste"! But that sort of thing can really take a bite out of your motivation.
    
    thombles
    
    I wonder if cargo will one day have a “virus scanner” mode where all downloaded crates get run through an LLM tool of your choice before it’s allowed to touch disk.
    
    ssokolow
    
    Not difficult to implement now.
    
    #!/bin/sh cargo fetch # TODO: Scan command goes here exec cargo --offline "$@"
    
    (A little bit more complexity required if you want to notice --locked, --target, and --manifest-path and propagate them to the cargo fetch line.)
    
    dpc_pw
    
    Interesting idea. I like it.
    
    Though even a rudimentary review that I'm advertising as built-in cargo-crev is taking (couple) minutes per even smallish crate. It seems to me some WoT sharing is required to cut the redundancy. Everyone doing LLM review all the time, for every crate is not going to scale, even if LLMs got 10x faster still.