LLVM: The bad parts
69 points by asb
69 points by asb
This pretty much matches my model of the problems with the project. The PR vs reviewers mismatch has always been fairly tough, but now there's an increasingly large number of slop generated PRs being filed which have both gargantuan PR descriptions for trivial changes (because again slop generation), are grossly over commented (because slop), and then if you do ask for any changes the "author" cannot actually make them because they did not write, and do not understand, the PR they are presenting as their own.
Now if we see those PRs - generally the oversized PR description at least is such a clear indicator of slop that this isn't a problem - should we even bother reviewing them? Maybe they do fix something, but if we see any issues during the review then we know they're not going to be fixed and the author will simply go off and make more slop PRs. The first one I came across made me actually look up the account, and while being years old (e.g. not a new spam account), they had a fairly average GH user level of commits over time, and then in the last month had like 300 PRs filed against 10 or 20 projects, none of which they had ever contributed to before.
So now we have PRs that may well be attempting to fix a real bug, but we know:
The code is produced by a text predictor, and those predictors are known for exactly three things: producing incorrect code, confidently presenting them as correct, and being actively designed to appear reasonable no matter how incorrect they might be. That last point makes them hard to review, as they're essentially following the same model as attackers trying to intentionally insert vulnerabilities. In essence slop PRs must be reviewed on the assumption of malice.
The PR author is not going to respond to, or even necessarily be able to respond to, review feedback - except maybe by re-running the slop generator.
In a similar vein, as with other OSS projects there's a constant trickle of made up security vulnerabilities from these garbage tools.
It's still amazing to me that people think these are good/trustworthy slop generators are given just how well it is documented that they are neither.
At the much more trivial end of the spectrum the style rules have issues.
First off PascalCase for local variables is horrifying (while this is true, it's not actually an actual problem :D). More seriously, there have been a number of naming rule changes over the years, but because existing code was not simply automatically updated (that kind of thing always makes blame and similar more annoying) there's a near random mix of new and old style code, and people tend to write code the matches the surrounding naming conventions rather than the new one so there's no forward progress on unifying it (and even reviewers have difficulty tracking what the capitalization rules are meant to be). This lack of consistency makes it confusing for new contributors as even the style bots don't enforce capitalization rules, so they write new code and the documented style does not match the code they're working in, and if they try to work out what the style should be by looking at prior recent commits even those are not consistent (even if the PRs had review comments about other style issues).
Similarly there's a hard 80 column cap, but the type names are long, and reasonable variable names are not trivially short, and so even trivial expressions end up having to be split across multiple lines, making the code harder to read. To try and deal with this problem, variables are frequently not given remotely useful names, for example the most common naming scheme for local variables is the acronym of the name of the variables type, e.g
TypeSourceInfo *TSI = ...;
RecordDecl *RD = ...;
...
auto *CXXRD = dyn_cast<CXXRecordDecl>(RD);
Sometimes that's fine due to small methods, but it's also used in large/complex functions simply because while the name is useless, the code is more readable due to the reduction if line breaks. The argument people often then use is to break "large" expressions into lots of different locals, but the combination of method names that are good, and the clear reason to use separate locals so those can also be named well means that even trivial expressions end up requiring line breaks, e.g. things like https://github.com/llvm/llvm-project/blob/c726fff61cb3bd5a3a98c52bb63b99fcb3b25524/clang/lib/Sema/SemaExprCXX.cpp#L1736
It is of course important to note that some of these issues are the result of C++ syntax, and many other cases of gratuitous line breaks in the existing code are simply due to the failure to use locals to hold complex subexpressions, and of course sometimes the reason that a local is not used is to scope lifetimes etc - e.g lifetime extension for temporaries only works for, well, temporaries - so storing what was previously a depending on lifetime extension in a local variable might become a UaF.
Now if we see those PRs - generally the oversized PR description at least is such a clear indicator of slop that this isn't a problem - should we even bother reviewing them?
Nod. While LLM agents have become good at generating changes, they remain significantly less capable at reviewing them. This creates a bottleneck where AI increases the volume of contributions without providing the corresponding review capacity. While they are quite good at local changes, they lack a good global understanding (context window limitation).
Related: there have been a lot of LLM agent discussions
I was one of top reviewers for the past several years. I think we have a reviewer burnout problem. It takes years to train an LLVM maintainer, yet most of their capacity is consumed by drive-by contributions that requires extensive guidance. There is also a misalignment of incentives: companies fund developers to write code and integrate it into their internal products (immediate benefit), but would not fund code reviews (while there are still second-order benefits like community building, long-term project health, the value is not meaningful to many leaders).
As an example, in 2018, I was the primary lld migration contributor for Google production, followed by cleanup/improvement work in 2019. Since then, lld was no longer my work. Maintaining lld has been 'community contribution' in my performance reviews, providing no leverage for achieving a high rating.
First off PascalCase for local variables is horrifying (while this is true, it's not actually an actual problem :D).
In 2019 there was a big variable renaming plan discussion, which was summarized here https://llvm.org/docs/Proposals/VariableNames.html
I wish that llvm/ and clang/ can switch to camelBack and newer subprojects like bolt/ should also use camelBack, but it will probably never happen due to churn concern.
More seriously, there have been a number of naming rule changes over the years, but because existing code was not simply automatically updated (that kind of thing always makes blame and similar more annoying) there's a near random mix of new and old style code, and people tend to write code the matches the surrounding naming conventions rather than the new one so there's no forward progress on unifying it (and even reviewers have difficulty tracking what the capitalization rules are meant to be). [...]
I am not aware a big naming rule change, but there is indeed major inconsistency. For example,
clang code uses FunctionName (while the llvm lld lldb mlir convention is functionName) and newer code tends to use functionName.Similarly there's a hard 80 column cap, but the type names are long, and reasonable variable names are not trivially short, and so even trivial expressions end up having to be split across multiple lines, making the code harder to read.
In 2021 dvyukov tried to increase 80 to 100 for sanitizers (in compiler-rt): https://reviews.llvm.org/D106436 . However, it is difficult to get consensus for such a large (sub)project, and the change got reverted.
Nod. While LLM agents have become good at generating changes, they remain significantly less capable at reviewing them.
No, these test predictors have become better at generating code that looks playable correctly. They present nonsense code with the same confidence as code that happens to be correct. They cannot review code for the same reason using them as code generators is bad: the entire design is around producing plausible looking code. The difference is that recognizing that a review is nonsense is significantly easier than doing so for any moderately significant amount of code.
I would actually not be as opposed to negative reviews from these tools as you can see them as a lightweight (in terms of time, not horrific environmental damage) probabilistic version of static analysis - maybe it could be considered Monte Carlo static analysis?
This creates a bottleneck where AI increases the volume of contributions without providing the corresponding review capacity.
The problem is that in addition to producing boatloads of bad PRs, the people spamming those changes cannot actually respond to review feedback, because they’re using automated cut-paste-from-stackoverflow to “write” their PRs. The text predictor can’t do anything either, because they’re designed for the sole purpose of making plausible looking results.
While they are quite good at local changes, they lack a good global understanding (context window limitation).
No, they have no understanding. The context window is not relevant: they are text predictors. A larger window just means that that the they are able to maintain the appearance of plausibility in ever larger patches. In other words making the window larger does nothing except increase the load on reviewers who are being forced to review larger and larger PRs written by what could charitably be labeled a first year cs student who is really good at copy and pasting code and then editing it to make it look like the surrounding code, irrespective of correctness.
The real killer of course is that the entire model for attack OSS code is to position yourself as a trusted contributor (in this case an inexplicably trusted text predictor), and then generate plausible looking code with an intentional vulnerability. It is already a lot of work to review even relatively small PRs from new contributors, adding an unending stream of slop PRs that need to be treated as if they were created by someone competent, when they are designed to do exactly what attackers want to do, is unmaintainable.
The fundamental problem is that vibe coders are unwilling to acknowledge the basic facts of how these text predictors work, continue to talk about them as if they have some kind of actual understanding, and then say that everyone else must deal with the garbage they produce. That last part is critical because until that happens they have to contend with the fact that needing these tools to do their job is a good signal for how much you can trust the code “they” write, how much they actually care about that code, and for the overwhelming majority their level of basic competence.
The fact that they can spam out enough garbage that they claim as their own work to swamp out all the work of people who are actually competent is an advantage to those people. Eventually they know that exhaustion will start forcing reviewers to accept that code.
We now this is what is happening due to the near absolute resistance of such people to the concept of saying up front that they did not author the PR. It is critical in this fiction that the people who are spamming this slop do not have to acknowledge that they are not actually the author.
I think this is a great summary. However, I would add three more points:
load [23 x {i8, i16}]), weird vector sizes (<23 x i34>), and overly large integers (i1234). We already discourage using these types, but unfortunately, there are users and they must be handled by transformations and all back-ends.It's actually a relief to see this post from the lead maintainer of LLVM, since it's really on point with the issues that exist. That gives me hope that there's a way forward, which would only improve LLVM's success further.
One topic which wasn't covered was user documentation for the unstable C++ APIs like Clang. For jank (native Clojure dialect on LLVM), I am JIT compiling C++ and then using Clang's AST to provide seamless C++ interop from Clojure. However, none of this stuff is documented and none of it actually feels supported. There are times when the functionality I want is buried within a Clang .cpp file with internal linkage. There are times when I have a couple of QualType objects or Decl objects and I want to answer a question like "Can I C-style cast between these types?" and there is nothing other than digging through Clang's code, searching Github, and asking LLMs which is going to help me with that. When you factor in larger tasks like overload resolution with ADL support, or gracefully handling JIT template instantiation failures, Clang is both the enabler of jank's awesome tech and also the largest impedance on progress.
If the stable C APIs for Clang captured what I needed, I'd use them in a heartbeat, since that would also cut down my compile-times significantly, but they don't.
If some centralized Clang team provided a paid support model, that could also solve most of my problems. I would happily throw money at this, if it would help unblock me, but every Clang dev I know is too busy to review a PR, let alone look into actual bugs or provide guidance in developing solutions. That's not to say they're not helpful, great devs, and great people. It's just to say that there's a huge capacity problem which gate keeps Clang/LLVM usage for those who can figure everything out themselves or who know the Right People.
I resonate with your frustration. It's a steep mountain to climb; the C++ syntax tree is incredibly complex, and the scale of Clang's internal concepts is massive. Most powerful parts of Clang require you to be a detective as much as a developer.
Back in 2016/2017, I was looking for C++ cross-reference tools. At the time, projects like rtags and cquery relied on libclang for parsing and indexing, but I realized that the C API was extremely lacking in features for anything sophisticated. To understand how a feature actually worked, I had to attach a debugger and trace libclang execution line-by-line.
In 2018, I had to migrate to the C++ APIs (primarily clangDriver, clangFrontend, and clangIndex) for ccls. The 'documentation' was essentially just studying code fragments within Clang itself, hunting for related utilities in clang/tools/, and using GDB to trace execution extensively.
I agree that external projects like jank are in a difficult spot. Realistically I think better modularization and API design only happen when there are in-tree use cases to justify them. It is notoriously difficult to justify architectural changes or new interfaces without a project inside the LLVM monorepo that actively exercises them. The maintenance reality is that if contributor A exposes an internal linkage symbol today to help an external project, contributor B might come along a year later, notice the symbol isn't being used by any other in-tree Clang library, and internalize it again to clean up the API. While adding a C++ unit test can prevent that specific regression, it’s hard for future maintainers to know if the motivation behind that test is still valid years later if there is no 'real' in-tree consumer to point to.
Most of the CI and build time problems can be solved with the Bazel overlay in LLVM. In fact, I know several projects that depend directly on LLVM and intentionally leverage Bazel so that they can upgrade to the latest commit/cherry-pick fixes with high confidence.
If you combine that with the remote build executions, which deduplicate build actions when they are sent to a remote build farm, 150 commits a day should cost a very minimal amount of compute to churn through.
I feel parts of this just happen once a project gets large, especially the compile times and the migrations stuff
For years now, I’ve heard that a “freeze” opreration for undef would be just around the corner, but we’re still in the unfortunate state that Rust over-claims what kind of uninitialized memory is UB to materialize a reference to. In practice, materialing a mutable slice of integer types is fine if you write to every slot of the slice before reading. Avoiding this documented-but-not-really UB on the Rust side puts MaybeUninit all over, which makes APIs bad.
I really wish we got an LLVM freeze already and appropriate exposure of the operation to Rust so that you could say they freshy-allocated memory could be treated as carrying some stable integer values that do not count as uninitialized.
I might not understand the freeze semantics you are looking for, but isn't the freeze instruction already in LLVM for undef and poison? I thought it has been there since 2019: https://llvm.org/docs/UndefinedBehavior.html
Maybe it hasn't been implemented for operations that Rust wants to use it for?
Rust won't change this, regardless of LLVM. Reading of garbage data is the heartbleed vulnerability, and &mut is guaranteed to be safe to read.
Rust might add some write-only reference type someday, but probably not for slices, since statically tracking initialized state there is harder.
For the foreseeable future, MaybeUninit, Write or other buffer abstractions are the way to go.
A potential improvement here would be a Rust-style PR assignment system.
This would be a great step in the right direction. LLVM has been pretty slow to take on this kind of infrastructure but is steadily making progress, so I'm optimistic.
However, if you do not upstream your code, then it will also not factor into upstream decision-making.
But I wonder if this is a factor in the SFrame kerfuffle. Is there reluctance to let it stew in a downstream fork?
The introduction of pre-merge testing on PRs did significantly improve the overall CI situation,
Agreed: this is what I'm talking about in regards to being slow to adopt these things. But great that it's here now.
Is there reluctance to let it stew in a downstream fork?
Absolutely. LLVM has a huge commit traffic and you can expect most things that are not upstream to break after a few weeks, both in patch applicability and functionality. Carrying large amounts of downstream patches therefore needs very frequent (~daily) rebasing to be maintainable and remains a huge effort — to the point, where the effort of upstreaming is lower than the effort of having a local patch.
Incidentally, I think this is also part of the reason of LLVM's success: there's a huge incentive to contribute changes back upstream, even though the license doesn't require it. The downside is of course the huge set of half-baked, semi-broken, or not-really-ready code that now lives in the monorepo, must be maintained, and cannot be removed due to active users.
It’s also a reason LLVM often fails to benefit from research projects that are built on LLVM. You start working on something in a branch, finally get it working, do a load of evaluation, fix issues, and up with something that you might want to talk to upstream about. But now two years have passed and there’s so much code churn that you can’t figure out how to apply your diffs. So you publish your paper and then let the code rot.
Academic research usually doesn't get upstreamed because there is a wide gulf between implementing a prototype in a fork to get some results for a paper, and an actual, production-quality implementation. Academia is usually not interested in working on the latter, and not incentivized to do so (which is not necessarily bad: There's no reason why the same people need to work on research and productization).
Having been on both sides of this, I disagree on several points:
there is a wide gulf between implementing a prototype in a fork to get some results for a paper, and an actual, production-quality implementation
This is true, but it's not as big a gulf as you might think. It's often the case of missing handling for corner cases (especially error handling, where a research prototype can just assert things for missing features and a production one would gracefully fall back to not breaking anything).
Academia is usually not interested in working on the latter, and not incentivized to do so
The latter is completely untrue. Academic departments and individual researchers are judged on impact. Having the result of your research in a production compiler used by big companies is a great way of demonstrating impact.
The problem is often a skills gap, that a lot of the code is written by PhD students who aren't experienced writing code for production environments. Often they are very strongly incentivised to acquire those skills (because they're useful when they try to get a job) but need mentorship / code review. But that means they have to be able to a PR into a state where someone can give that feedback, or they need people to work with them before that point. But if it takes two months to get a patch into a state where it even applies and works, that's not going to happen.
I only agree in part, I think the core reason is the research community culture. While rebasing of course adds some cost, most research code I've seen is so bad and incomplete (not to mention bogus due to paper-driven development) that still a substantial amount of time would be needed to get it in an upstreamable shape. Researchers could prioritize upstreaming and upstream-first development, but don't because (a) it's more effort, (b) the effort is not appreciated (typically neither from reviewers nor from funding agencies nor in hiring decisions), and (c) due to the fear that someone else could steal their lunch (i.e. gets a paper published earlier due to parts of the work being already done and publicly available).
Most research project code I have encountered is very much “research project code”, but I don’t think it’s reasonable for even a research project to be “years behind main”. You haven’t got much code you should just be merging or rebasing daily.
I’ve worked on reasonably large side projects on clang and while there’s some work involved tracking main has never been a problem.
But again to echo Nikic and aenglke the quality and completeness of most research project code I have seen has not been at the point where landing in main is reasonable.