An (almost) catastrophic OpenZFS bug and the humans that made it (and Rust is here too)

99 points by bal-e

invlpg

I’m a self-proclaimed Zig fan, but this seems a little disingenuous to me. @andrewrk uses const prolifically where the original code mutated variables in place.

If he had written

    var cols = vdrz.vd_original_width;
    // ...
    cols = vdrz.get_logical_width(txg);

and

    var psize = (asize >> ashift);
    psize -= nparity * (divCeil(asize_shifted, cols) catch unreachable);
    psize <<= ashift;

which is a more direct translation of the original code, then the Zig compiler wouldn’t have caught either mistake. Surely a fairer way to put it is that @andrewrk’s preferred style, and perhaps the style of idiomatic Zig caught the error, not the language itself.

If the original code had been written in the same way, using const everywhere, then clang/gcc would’ve caught the same error with -Wunused-variable.

LenFalken

I was just going to say… This blog post wouldn’t have even been made had it failed. Not a good signal.
- mtlynch
  
  I’m confused by this criticism.
  
  If you develop a tool X that does everything tool Y does and has extra features, why wouldn’t you share results when you find a new thing that tool X does better? There’s no reason for anyone to publish thousands of articles about the ways that Zig and C have similar behavior.
  - masklinn
    
    There’s no reason for anyone to publish thousands of articles about the ways that Zig and C have similar behavior.
    
    If C shows behaviour which is considered defective and zig is found to have the same defect, it absolutely makes sense to publish about it as an opportunity for reflection and finding ways to mitigate or fix the defect.
    
    It’s definitely a lot more interesting than changing the problem in a way C compilers could also solve it then take a victory lap.
    
    mtlynch
    
    It’s definitely a lot more interesting than changing the problem in a way C compilers could also solve it then take a victory lap.
    
    I get that criticism.
    
    But @LenFalken seems to be saying that we should be suspicious of Zig because Andrew wouldn’t have published this blog post had he found the problem to be identical in Zig, which is a different critique that I don’t understand.
  - andrewrk
    
    I always write Zig code to minimize unnecessary use of variables. Why would I make an exception for this snippet?
    
    invlpg
    
    I don’t disagree and don’t think that’s a bad style. In fact most of my Zig code ends up looking the same. But the article opens with questioning whether Zig would have caught the bug. I’d argue that Zig didn’t catch it, but your adherence to that particular style of writing code did it.
    
    I think the article is a solid argument in favor of maximizing the use of const, but doesn’t really make a case for Zig by itself.
    
    andrewrk
    
    Well, I hadn’t considered about unused assignments to variables until this lobste.rs thread, so I’m glad I made the post :-) Will be looking to add that to the language now.
    
    invlpg
    
    Nice! I wonder if that could eventually be the precursor to some form of linear typing too. I’ve read that Hare is looking at exploring that space.
    
    pervognsen
    
    I think it’s pretty common for compilers to do the proper liveness analysis for unused variable warnings. Here’s an example where the liveness analysis has to compute a least fixed point to determine that x is dead:
    
    void f(int n) { int x = 0; for (int i = 0; i < n; i++) x++; }
    
    And here’s clang -Wall:
    
    <source>:2:9: warning: variable 'x' set but not used [-Wunused-but-set-variable] 2 | int x = 0; | ^
    
    If you refcount uses to calculate liveness (which is probably most people’s first idea) you end up computing the greatest fixed point instead, which is less precise when the dead variable’s dataflow graph has cycles. There was a fun GC paper a while back pointing out that RC and GC compute greatest and least fixed points respectively, and RC famously cannot collect cycles. The situation with static analysis of liveness is analogous. Paul Khuong has an old blog post with the observation that most people’s first instinct (“most people” meaning people who don’t work on compilers or static analysis) when solving a problem often leads to the suboptimal fixed point: https://pvk.ca/posts/62/
    
    pervognsen
    
    After more experimentation it appears that clang -Wall does not report a dead variable if it’s used as an rvalue expression:
    
    void f(int n){ int x = 0; x = x + 1; }
    
    That’s not what I’d expect as a programmer! It’s silly that x++ or x += 1 (x is only used as an lvalue) vs x = x + 1 (x is used as an rvalue) makes any difference to the warnings when there’s no relevant difference in semantics or programmer intent. Someone on Mastodon tested it with clang-analyzer and it didn’t catch this either.
    
    robn
    
    It’s silly that x++ or x += 1 (x is only used as an lvalue) vs x = x + 1 (x is used as an rvalue) makes any difference to the warnings when there’s no relevant difference in semantics or programmer intent.
    
    They actually have subtly different semantics.
    
    x = x + 1 is two separate ops: a binary add, and a “simple” assignment. x + 1 has to be evaluated first, and only then is x = x' is resolved. In particular, that can mean two separate accesses to wherever x is stored.
    
    Meanwhile, x += 1 is a “compound” assignment. It’s required to have the same result as x = x + 1, but it is a single op, and x will only be evaluated once.
    
    x++ is not an assignment at all; ++ is an operator that returns x + 1, but the change in the stored value is a side-effect which doesn’t have to be resolved immediately.
    
    The difference for our purposes is that x++ and x += 1 are “indivisible”: the compiler can say with confidence that if you never observe the result, then it wouldn’t have mattered if it didn’t do it.
    
    For x = x + 1 though, the issue is that x + 1 may have side effects that affect the execution of the program. Less for a simple integer on the stack, but it’s really dependent on the type and location in memory of x. x might be a deref for example (s->f), or its type might be volatile.
    
    So to be able to say it’s definitely not used, the compiler has to understand more about the nature of x, which means a more complex analysis.
    
    I don’t really know how hard it would be to produce something useful, but possibly if you tried to narrow it to just cases we can be sure about, there wouldn’t be many opportunities left! I’m guessing that compiler authors probably know what they’re doing, and having to opt-in to a different tool that is slower and less reliable but more expansive is a better tradeoff.
    
    robn
    
    Incidentally, I’ve had a lovely afternoon playing with Clang’s frontend:
    
    void f(int n) { int x = 0; x++; } $ clang -Xclang -ast-dump -fsyntax-only -Wall f.c f.c:2:9: warning: variable 'x' set but not used [-Wunused-but-set-variable] 2 | int x = 0; | ^ TranslationUnitDecl 0x55b3cfa1ca58 <<invalid sloc>> <invalid sloc> `-FunctionDecl 0x55b3cfa83cd0 <f.c:1:1, line:4:1> line:1:6 f 'void (int)' |-ParmVarDecl 0x55b3cfa83c00 <col:8, col:12> col:12 n 'int' `-CompoundStmt 0x55b3cfa83eb8 <col:15, line:4:1> |-DeclStmt 0x55b3cfa83e68 <line:2:5, col:14> | `-VarDecl 0x55b3cfa83de0 <col:5, col:13> col:9 used x 'int' cinit | `-IntegerLiteral 0x55b3cfa83e48 <col:13> 'int' 0 `-UnaryOperator 0x55b3cfa83ea0 <line:3:5, col:6> 'int' postfix '++' `-DeclRefExpr 0x55b3cfa83e80 <col:5> 'int' lvalue Var 0x55b3cfa83de0 'x' 'int' 1 warning generated.
    
    void f(int n) { int x = 0; x += 1; } $ clang -Xclang -ast-dump -fsyntax-only -Wall f.c f.c:2:9: warning: variable 'x' set but not used [-Wunused-but-set-variable] 2 | int x = 0; | ^ TranslationUnitDecl 0x5641b13ada58 <<invalid sloc>> <invalid sloc> `-FunctionDecl 0x5641b1414cd0 <f.c:1:1, line:4:1> line:1:6 f 'void (int)' |-ParmVarDecl 0x5641b1414c00 <col:8, col:12> col:12 n 'int' `-CompoundStmt 0x5641b1414ef0 <col:15, line:4:1> |-DeclStmt 0x5641b1414e68 <line:2:5, col:14> | `-VarDecl 0x5641b1414de0 <col:5, col:13> col:9 used x 'int' cinit | `-IntegerLiteral 0x5641b1414e48 <col:13> 'int' 0 `-CompoundAssignOperator 0x5641b1414ec0 <line:3:5, col:10> 'int' '+=' ComputeLHSTy='int' ComputeResultTy='int' |-DeclRefExpr 0x5641b1414e80 <col:5> 'int' lvalue Var 0x5641b1414de0 'x' 'int' `-IntegerLiteral 0x5641b1414ea0 <col:10> 'int' 1 1 warning generated.
    
    void f(int n) { int x = 0; x = x + 1; } $ clang -Xclang -ast-dump -fsyntax-only -Wall f.c TranslationUnitDecl 0x55ca390e1a58 <<invalid sloc>> <invalid sloc> `-FunctionDecl 0x55ca39148cd0 <f.c:1:1, line:4:1> line:1:6 f 'void (int)' |-ParmVarDecl 0x55ca39148c00 <col:8, col:12> col:12 n 'int' `-CompoundStmt 0x55ca39148f38 <col:15, line:4:1> |-DeclStmt 0x55ca39148e68 <line:2:5, col:14> | `-VarDecl 0x55ca39148de0 <col:5, col:13> col:9 used x 'int' cinit | `-IntegerLiteral 0x55ca39148e48 <col:13> 'int' 0 `-BinaryOperator 0x55ca39148f18 <line:3:5, col:13> 'int' '=' |-DeclRefExpr 0x55ca39148e80 <col:5> 'int' lvalue Var 0x55ca39148de0 'x' 'int' `-BinaryOperator 0x55ca39148ef8 <col:9, col:13> 'int' '+' |-ImplicitCastExpr 0x55ca39148ee0 <col:9> 'int' <LValueToRValue> | `-DeclRefExpr 0x55ca39148ea0 <col:9> 'int' lvalue Var 0x55ca39148de0 'x' 'int' `-IntegerLiteral 0x55ca39148ec0 <col:13> 'int' 1
    
    void f(int n) { int x = 0; (*(&x))++; } $ clang -Xclang -ast-dump -fsyntax-only -Wall f.c TranslationUnitDecl 0x559ca0d73a58 <<invalid sloc>> <invalid sloc> `-FunctionDecl 0x559ca0ddacd0 <f.c:1:1, line:4:1> line:1:6 f 'void (int)' |-ParmVarDecl 0x559ca0ddac00 <col:8, col:12> col:12 n 'int' `-CompoundStmt 0x559ca0ddaf58 <col:15, line:4:1> |-DeclStmt 0x559ca0ddae68 <line:2:5, col:14> | `-VarDecl 0x559ca0ddade0 <col:5, col:13> col:9 used x 'int' cinit | `-IntegerLiteral 0x559ca0ddae48 <col:13> 'int' 0 `-UnaryOperator 0x559ca0ddaf40 <line:3:5, col:12> 'int' postfix '++' `-ParenExpr 0x559ca0ddaf20 <col:5, col:11> 'int' lvalue `-UnaryOperator 0x559ca0ddaf08 <col:6, col:10> 'int' lvalue prefix '*' cannot overflow `-ParenExpr 0x559ca0ddaee8 <col:7, col:10> 'int *' `-UnaryOperator 0x559ca0ddaed0 <col:8, col:9> 'int *' prefix '&' cannot overflow `-DeclRefExpr 0x559ca0ddae80 <col:9> 'int' lvalue Var 0x559ca0ddade0 'x' 'int'
    
    pervognsen
    
    Less for a simple integer on the stack, but it’s really dependent on the type and location in memory of x. x might be a deref for example (s->f), or its type might be volatile. So to be able to say it’s definitely not used, the compiler has to understand more about the nature of x, which means a more complex analysis.
    
    All of those semantic subtleties already have to be addressed in an optimizing compiler (as evidenced from the -O1 disassembly on these examples) to perform sound optimizations. I agree about the increased complexity; the compile-time overhead on -O0 (which people expect to be fast) of always having this on would be one argument against having it on by default, but that argument is less convincing for clang-analyzer (and I checked and clang-analyzer does have some kind of liveness analysis). In Clang’s case I think the more likely explanation is that it didn’t have an IR (ClangIR) until very recently, so it had to do everything with the AST. Despite Clang’s tight coupling to LLVM, using LLVM’s IR to drive this kind of analysis wouldn’t be reasonable.
    
    robn
    
    Sure. I was just pushing back on “these are semantically equivalent”. They aren’t in all cases, and if you’re only doing semantic analysis then you can miss them. I think that’s broadly why C compilers don’t see it.
    
    And yes, clang-analyzer (and most analyzers) will do data flow and control flow analysis, which is where dead stores are found.
    
    fanf
    
    Does Zig have something like the clang static analyser that can catch the mistake as written in the original style? Arguably it’s a bug in gcc and clang that they need a separate tool for such a simple analysis: it should be a built-in warning like it is in Rust.
    
    kristoff
    
    If the original code had been written in the same way, using const everywhere, then clang/gcc would’ve caught the same error with -Wunused-variable.
    
    It should be noted that Zig will catch the error by default when you compile Andrew’s version, no need for any optional flag.
    
    robn
    
    Starting to wish I’d never said anything…
    
    mtlynch
    
    I hope you’re not taking either of these articles as attacks on you.
    
    Your writeup was sensible and informative. It’s leading people to think more about tooling and static analysis, even if the discussion is getting a bit sarcastic.
    
    Thanks for your writeup and the work you do on OpenZFS!
    
    robn
    
    That’s very kind! Thank you for asking!
    
    No, I haven’t been taking anything too personally, and definitely not as attacks. Honestly, I’ve been very impressed with the quality of the discussion here in particular, and I’ve learned quite a bit too.
    
    You do grow a certain kind of thick skin working on OpenZFS, as there’s plenty of people around quite happy to tell you that you are stupid/unserious/an existential threat to all of open source. It’s nothing we haven’t heard before, though I admit that “not using -Wunused” is a novel new justification for it (even though (a) we do and (b) it doesn’t work here anyway). There was a bit of that on a couple of well-known tech discussion places that I do not visit as a rule because they upset me too much, so I was quite happy to close those tabs as soon as I opened them!
    
    (also a lot of LLM sales, which is like, ok maybe but also there are other ways to crack a nut besides bringing the weight of the entire moon down on it).
    
    My comment up was partly a wry eyeroll that we’re still talking about this, and also mild discomfort at something I kicked off now being used to dunk on someone else, even if it did make me chuckle a bit too :) That’s ok though, we all contain multitudes or something.
    
    Anyway, back to the work. Cheers :)
    
    Vaelatern
    
    The one thing I’m learning from it all is a consistent pattern of const vars on my heap is probably a good thing.
    
    cve
    
    This is something that bothered me a lot when I started to learn Rust initially, but the more I got into the language, the more I started to appreciate to have const by default and mutability only as an opt-in.
    
    worr
    
    I’m honestly a bit annoyed at both this and the Zig post. The author of the original article had mentioned that finding this in C is doable, but both in the article and the comments, the author mentioned that because this is an old code base, it requires a substantial amount of time to knock down the linting error backlog. I don’t understand the point of smugly pointing out how your favorite tool catches an unused var.
    
    moltonel
    
    IMHO the Zig article was the annoying one, for the reason you mention and for “cheating” by transforming the code just enough to let the Zig toolchain find the bug.
    
    This article is just poking fun at the Zig article (even reusing its layout), not presuming to be a counterpoint to the orihinal article. Cheaky, but deserved.
    
    robn
    
    Thanks for getting it. It’s six days and I’m still answering emails… 😬
    
    orib
    
    Similarly, when trying the original code using the Plan 9 C compilers:
    
    % pcc -c -FTw test.c warning: /tmp/test.c:29[stdin:173] set and not used: cols warning: /tmp/test.c:38[stdin:182] set and not used: psize
    
    andrewrk
    
    Neat!
    
    alper
    
    we could describe those two types of sizes as being separate distinct things
    
    struct PhysicalSize(u64);
    
    Maybe just typing out the name would also have helped? Screens are wider these days and lots of people can touch type so I don’t really see the point anymore of psize. It just looks lazy to me.
    
    rbuchberger
    
    Yeah, longer variable names would have prevented this. Humans don’t parse characters with silicon, we read words with meat. Maybe cryptic abbreviations in “must not fail” code are best avoided?
    
    robn
    
    Maybe, maybe not. I get that they look like cryptic abbreviations, but they’re also terms of art that you have to know to work in this codebase. Does that make them more or less likely to be glossed over? I for one suspect that if I saw allocated_size spelled out longform I would still skim past, but maybe thinking “oh yeah, that’s the new name for asize that robn forced us to change to”.
    
    rbuchberger
    
    Fair enough, If it’s used everywhere and the whole team understands it I can see the benefit of abbreviating it. Are there short names that differ by more than one character? If you have superficially similar entities in your codebase with suble, but crucial differences, maybe you could make the names (and the type definitions if possible) more obviously different.
    
    I have an aviation background and I agree emphatically with your assertion that “just get good” is not a reasonable answer. If a competent aircrew has a disaster or a near miss, “pilot error” is never the whole cause. There’s a deeper issue somewhere that needs mitigation. Any system that involves humans but can’t tolerate human mistakes is guaranteed to fail.
    
    david_chisnall
    
    Fair enough, If it’s used everywhere and the whole team understands it I can see the benefit of abbreviating it
    
    There are three implicit assumptions here:
    
    Everyone on the team who may touch this code is familiar with that specific part.
    
    Everyone who may touch this code is a member of ‘the whole team’ and you don’t have outside contributors or ever need to onboard new people.
    
    Humans never make one-character typos when looking at one word and typing an almost identical one.
    
    If all of these are true, I can see the benefit of using abbreviations. The fact that I believe none of them to be true is why the coding conventions for CHERIoT explicitly disallow abbreviations. People with different backgrounds often interpret them differently. We’ve had cases where people with a security, systems, or hardware background all have different preconceived notions of what an acronym means.
    
    On top of that, longer names tend to make it easier to have good autocomplete (and even vim has good C/C++ autocomplete with clangd these days). Especially if you use the prefix for context, so typing the prefix will find all of the things in a given context quickly.
    
    caleb
    
    People with different backgrounds often interpret them differently. We’ve had cases where people with a security, systems, or hardware background all have different preconceived notions of what an acronym means.
    
    On the other hand, a name consisting of full English words rarely captures the semantics of the variable that it refers to. People still have different preconceived notions, but a longer name can make people more likely to trust their preconceived notions.
    
    robn
    
    I have an aviation background and I agree emphatically with your assertion that “just get good” is not a reasonable answer. If a competent aircrew has a disaster or a near miss, “pilot error” is never the whole cause. There’s a deeper issue somewhere that needs mitigation. Any system that involves humans but can’t tolerate human mistakes is guaranteed to fail.
    
    I wasn’t sure last night when I wrote the post what my point was, but over the course of the day I’ve realised that what I was trying to get at was exactly what you’ve just described here. Thanks :)
    
    masklinn
    
    You probably know already but “they write the right stuff” is an essay about this sort of mindset in software engineering, it talks about the practices of the shuttle software group, and the blame and feedback cycle over “the process”.
    
    While it’s likely way too much for the average software project, one of the things I found interesting is the feedback cycle over basically any bug which makes it through the process, to try and see if the process can be improved to catch that bug in the future. Anything which makes it through is considered a systemic issue.
    
    robn
    
    Thanks, it’s been a long time since I’ve thought about that essay.
    
    When I used to work as an SRE, we had a kind of soft rule that any time something new or weird came up, if for whatever reason we couldn’t understand what it was and why it happened and fix it properly, we should never just shrug and say “oh well, that was weird, whatcha gonna do”. We should do something - record some logs, screenshot a load graph, write a monitor to bring some oddly high metric to the fore, write a script to correlate a sequence of interesting events. The idea is that we don’t want a repeat performance to be the “same” issue - we want to be able to use it to build our understanding of the previous time. The worst thing is a problem occurs a second time and adds nothing to your understanding of the first time it happened.
    
    It’s not quite the same, but I still try to apply that principle now while writing code: do one thing that will help the next time it comes up. In this case it’s been to mess with tools a little, but also to write something about it. Generating some discussion wasn’t quite part of my plan, but that’s going to help too (not least, some small ideas for getting more static analysis into our regular build cycle). “The journey of a thousand leagues” etc.
    
    spc476
    
    I have an aviation background
    
    About 15 years ago or so, a friend of mine was helping to replace an instrument on a Cessna airplane and was amazed that you were only allowed to use one type of wire, which only came in one color. There was no red wires for hot, black for ground, or other colors for signals. Just white. It made the job much more difficult, but that was just how it was in aviation it seems.
    
    tonyarkles
    
    Heh, that wire is expensive and fantastic. Having spools of it in different colours would be eye-wateringly expensive. One of the cool things though is the white insulation usually has a titanium oxide in it and can be really cleanly and permanently laser-labelled directly into the insulation for the whole length of the wire. It’s pretty beautiful but I definitely know that feeling of shock and awe the first time you see an all-white wiring harness :D
    
    spc476
    
    I asked about different colors, and I was told that no, it had to be white—different colors were not allowed. At the time, I suspected it was because no one wanted to pay for alternative colored wire to be certified.
    
    habibalamin
    
    I’m sure there are psycholinguistic studies on how much likelier it is to confuse two things that look and/or sound very similar. Maybe we should look it up and can make this thread more scientific, but I don’t think it’s controversial to say that similar looking things are easier for the brain to mix up.
    
    I don’t think @rbuchberger is suggesting that your reflexive but conscious behaviour would change, rather that the subconscious would simply have more redundancy and your brain would just be less likely to mix them up, despite expending the same brainpower (skimming).
    
    ocramz
    
    This, and the compounding effect of cognitive fatigue induced by staring at a screen the whole day.
    
    meithecatte
    
    FWIW, you don’t necessarily have to go all the way to allocated_size. Maybe alloc_sz and phys_sz, though?
    
    PuercoPop
    
    The issue is not the length of the names. The problem is that the two words are written mostly the same. At least the difference is at the start of the word. Otherwise it would be even harder for a human to distinguish between the two.
    
    The words in question are psize and asize. They are 5 characters long each. 4 are the same and 1 is different. The ‘signal-to-noise’ ratio is dismal. The names emphasize what the things have in common, they are sizes. Instead of emphasizing how they differ, physical vs allocated. Note that even with ‘descriptive’ names one can end up with words that are mostly the name.
    
    Following a similar style physsz and allocsz would be an improvement imho. For those that prefer shorter names, the following has some wise remarks https://nsl.com/papers/style.pdf.
    
    intelfx
    
    Following a similar style physsz and allocsz would be an improvement
    
    I think what many commenters here are missing is that the words “asize”, “lsize” and “psize” are terms of art in ZFS.
    
    Changing them to “physsz” and “allocsz” (in such a way that doesn’t introduce an impedance mismatch between this code and what the rest of the ZFS community speaks) would entail rewriting all of the codebase, all of the documentation and lore, and all of the ZFS operators’ collective knowledge.
    
    PuercoPop
    
    I think what many commenters here are missing is that the words “asize”, “lsize” and “psize” are terms of art in ZFS.
    
    We do understand they are technical jargon. And I do understand the importance of using the same language (“ubiquitous language” for buzzword compliance). Other FS related areas use similarly bad names, like atime, mtime, etc. I still think it is a bad name.
    
    I understand that changing existing practice/conventions is a huge life, which is why it is seldomly done. Even outside the context of software development. (ej. We still say that current flows in the direction opposite to electrons because the convention was established before we knew what electrons where).
    
    The goal of my first comment was to argue that it was a bad name not because it was short, but because of they chose words that are mostly written the same. This is something that happens a lot more with longer variable names. And it still hampers readability.
    
    robn
    
    Following a similar style physsz and allocsz would be an improvement imho.
    
    All other things being equal, I strongly agree.
    
    srtcd424
    
    One size doesn’t fit at all, sadly. My ND brain actually struggles with anything that’s too verbose. It still seems mad that in 2025 we’re still manipulating code as text and not structured data that would allow people to work with it according to their own needs/preferences (eg aliasing identifiers to be longer or shorter.)
    
    rbuchberger
    
    Oh I completely agree - overly long names are just as unreadable as overly short ones. In that case I think the solution is to narrow the surrounding scope/context so the name itself doesn’t have to carry all the specificity.
    
    robn
    
    I don’t know if it would make it more or less likely to happen. It’s certainly not obviously better, I think.
    
    Those three size types are used heavily throughout the codebase, as well as baked into the on-disk format - they’re fundamental to understanding how everything works. So we’d be changing it everywhere, so it would end up looking the same as everything else, which is pretty much perfect conditions for the brain glossing over it.
    
    Which really goes back to the point I was trying to make, which is that humans are bad at noticing details a lot of the time, even when they’re trying really hard to. We can shuffle things around a little bit, maybe, but I’ve never seen anything to suggest that we can ever overcome it, and our collective body of work seems to back that up.
    
    alper
    
    It’s obviously better to write things out. People’s brains can then do with that as they please.
    
    robn
    
    You are of course free to claim that, but it is not obvious to me.
    
    andyc
    
    Isn’t this bug trivially caught by unit testing?
    
    The bug is that it simply returns one of its arguments, instead of doing any calculation. That will make basically any reasonable test fail.
    
    I think the bigger problem is that there is not a culture of testing in C.
    
    The type system solution may work sometimes, but it doesn’t catch the case where the answer if off by 1, off by a power of 2, overflows, underflows, or an infinite number of other possible bugs.
    
    Write unit tests!! Especially for pure functions. They are so easy to test.
    
    robn
    
    It’s testable, but I don’t think it’s trivial.
    
    On the function itself, it’s not actually pure. In general, asize and psize are not correlated at all - psize is an amount of data, but asize is basically opaque to the specific driver in use. The units are not convertible at all; in fact, asize doesn’t even have to be a “size” - it could be an identifier or a handle for something that the vdev uses to get that data back later.
    
    This function I think is a good example of why things in C can be challenging to test. In this case, the result is extremely dependent on the overall vdev state (number, size and layout of disks) and also the history of that state (txg can basically be thought of as a timestamp). To test it thoroughly, you effectively have to mock the vdev_t, and doing that is complicated because layouts are dependent on all sorts of weird things, and it’s not uncommon to end up with a pretty long string of dependencies required.
    
    And we do do this in some places. Consider the raidz_test tool, which is basically a unit test for the raidz math. It works by building up different layout “maps” and then blasting data into it and making sure both the columns and the data within them are generated correctly. But, since in the real world the “map” is entirely based on the incoming user IO, it builds up fake IO objects as inputs to the map generation function.
    
    I think some of those things point towards API mis-design, but some of that is just C things too. Structs are very porous, and the language really nudges you towards passing whole structs around rather than just the values you need inside them. Sometimes its hard to change that too - in some contexts, it can have a measurable effect on performance (copies, register spills) or can blow out your stack. Sometimes it’s not (here’s one I did last week).
    
    And there’s another dozen annoyances, obstacles and other stuff that we’d need to clear first, but I’ve yapped enough. Ultimately, it’s all doable, and we should, and I’m slowly picking things off as I can. It’s just a long way off.
    
    andyc
    
    That is all fair … but I would note that I have basically never seen a Rust project WITHOUT unit tests.
    
    There is a strong culture of testing there, so if you were to switch to Rust, I imagine the testing would catch a lot of your bugs.
    
    It’s definitely not “abandon C” or “rewrite it in Rust”; those aren’t practical.
    
    I also think there’s a bit of a blind spot in the way Rust is sold to C programmers
    
    So yes, the Rust type system is something you can’t get without rewriting all your code in a different language, which as you say isn’t practical in every case.
    
    But you CAN get the testing in C, and IMO that is more than half the story!
    
    I think the main inhibition in C projects is usually the build system – it should be set up to build and run multiple test binaries (in parallel), and it should also have ASAN, UBSAN variants, and more
    
    I wrote my own little build system on top of Ninja to accomplish that for my C++ code. CMake definitely has support for tests, but I honestly don’t remember if it needs to rebuild for ABSAN/UBSAN, which is my pet peeve (e.g. variant builds should not invalidate each other )
    
    I also run my tests in various preprocessor modes like -D GC_EVERY_ALLOC, which is hugely helpful. This also seems inhibited by typical build systems. (I’m sure it is possible, but IME most C projects that use makefiles have a lot of friction around these things)
    
    (And granted, I am talking about user space code, not kernel code. But I imagine that the same thing applies in Rust – you run your tests as normal user space binaries, and you can also do that in C, with enough effort.)
    
    spc476
    
    The function in question is qualified as static, which means any tests have to reside in the same file, which would have to be global in nature. Or the function would have to be global, which defeats the purpose of using static to hide an implementation detail. I would contend that with C, “unit tests” are not as easy to write as in other languages, due to the nature of C compilation and linking.
    
    akavel
    
    Wouldn’t the following work?
    
    /* file: foobar.c */ static uint64_t vdev_raidz_asize_to_psize(vdev_t *vd, uint64_t asize, uint64_t txg) { ... } #ifdef WE_R_DOIN_UNIT_TESTIN #include "foobar_utests.c" #endif
    
    mcherm
    
    Have you worked on (or seen) a C project that used this approach?
    
    fanf
    
    Yes, although reversed. There are several tests in BIND where the test program #includes the library implementation. For example
    
    https://gitlab.isc.org/isc-projects/bind9/-/blob/main/tests/isc/hashmap_test.c
    
    andyc
    
    That’s a thing that can easily be changed. I compare these two things:
    
    don’t use static, and write unit tests that catch bugs
    
    use static, and have bugs that could have been caught
    
    For me the choice is extremely easy.
    
    It is a bit of a tax, but ultimately I also view C as only one part of the system I’m working on, and the other parts are written in languages that are easier to test (which is not the case for this kernel code, admittedly).
    
    I think the bigger inhibition in C projects is the build system, which is not always set up for multiple test binaries.
    
    wezm
    
    Would wrapper types like this avoid the bug in C too?
    
    typedef struct { uint64_t size } asize; typedef struct { uint64_t size } psize;
    
    robn
    
    Maybe, at the cost of a good amount of readability, and still needing some programmer discipline.
    
    The problem is that they’re now compound types, not scalar, so they have different rules for conversions, layout, etc.
    
    Say we have:
    
    typedef struct { uint64_t a_size; } asize_t; typedef struct { uint64_t p_size; } psize_t;
    
    Initialising them becomes awkward:
    
    psize_t psize = 4096; // error: invalid initializer psize_t psize = (psize_t)4096; // error: conversion to non-scalar type requested psize_t psize = { 4096 }; // ok.
    
    And we can’t initialise them on the fly (initialisers are not expressions):
    
    psize_t asize_to_psize(asize_t a); psize_t psize = asize_to_psize({ 8192 }); // error: expected expression before ‘{’ token
    
    Or return them either:
    
    psize_t asize_to_psize(asize_t a) { return { 8192 }; // error: expected expression before ‘{’ token }
    
    They’re hard to work with too, since we no longer get math ops:
    
    asize_t asize = { 8192 }; psize_t psize = { asize << 1 }; // error: invalid operands to binary << (have ‘asize_t’ and ‘int’)
    
    asize_t asize1 = { 8192 }; asize_t asize2 = { 4096 }; asize_t total = asize1 + asize2; // error: invalid operands to binary << (have ‘asize_t’ and ‘int’) return (asize1 == asize2); // error: invalid operands to binary == (have ‘asize_t’ and ‘asize_t’)
    
    Still, they can’t be accidentally used in place of each other:
    
    asize_t asize = { 8192 }; psize_t psize = { asize.a_size << 1 }; return (asize == psize); // error: invalid operands to binary == (have ‘asize_t’ and ‘psize_t’)
    
    psize_t asize_to_psize(asize_t a) { psize_t p = { a.a_size >> 1 }; return a; // error: incompatible types when returning type ‘asize_t’ but ‘psize_t’ was expected }
    
    In reality though, there’d be macros to help with all this, but they are difficult to write and maintain, and still require discipline to use. Would it be worth it to avoid this shape of bug? Hard to say. Maybe, but in deciding these things I imagine myself trying to persuade my colleagues that this is better, and right now I don’t feel great about my odds.
    
    masklinn
    
    FWIW it’s pretty heinous but in an expression context
    
    (psize_t) { 8192 }
    
    does what you need and is not a cast despite the cast-like syntax, it’s the “compound literal” syntax
    
    robn
    
    Oh thanks, I thought there was a way but I couldn’t find it. Yeah, horrid 😰
    
    bondolo
    
    Even a plain ‘typedef’ of ‘typedef uint64_t asize’ . I have worked on projects where no raw types were allowed. It really helped with good static analysis tools (FlexeLint) that would then complain whenever you mixed types.
    
    masklinn
    
    A typedef is just an alias, as far as the compiler is concerned they’re the same thing. So the only thing typedef would offer is a hint to the developer, hint which was already in the names and bith the author and reviewer missed.
    
    You can take the original code, add the missing bits so it compiles, and replace the types of psize and asize by your typedefs, and it will still compile just fine with no warning (even at Wall).
    
    It really helped with good static analysis tools (FlexeLint) that would then complain whenever you mixed types.
    
    So now you need additional tooling which specifically decides to treat typedefs as newtypes.
    
    robn
    
    Unfortunately in this case, typedefs are merely alternate names for an underlying type. typedef uint64_t asize and typedef uint64_t psize just get you two more ways to spell uint64_t. Hell, depending on your environment, uint64_t is probably itself a typedef for unsigned long long or something.
    
    masklinn
    
    Yes, however due to C’s lack of conveniences around those this can get quite unwieldy (in this case it’s fine).
    
    masklinn
    
    Here even without any sort of types shenanigans a compiler (or linter) able to warn about unused stores would tell you about psize (and cols) being dead stores.
    
    And that’s one of the reasons I find some langage’s refusal to compile on unused variables so frustrating: they’ll bust your balls about that during dev but they’ll completely miss unused stores even though every issue which can be signaled from an unused variable is as or more likely to be signaled by an unused store.
    
    robn
    
    For real. The common C static analysers usually report it ok. Do you know any more about why C compilers generally don’t? I’m assuming either its actually expensive to work out, or its not reliable enough. I would turn that warning on instantly and deal with any false positives, but maybe there’s a better reason?
    
    masklinn
    
    I have no idea. I tried checking clang’s but tracker but found nothing for diagnostics, just dead store elimination.
    
    Might just be that nobody has really proposed (/ put the effort in) it. After all they do have “unreliable” diagnostics in extra/everything so even if dead stores was not a 100% that should not be a blocker.
    
    rikthevik
    
    This is exactly the kind of bug that terrifies me as well.
    
    For example, how about a primary key on a Django model. Is an Employee.id an int? Sorta, but not really. It’s an EmployeeID, and that’s an important distinction.
    
    Which operations should be defined on it? Surely not all of the int operations. It doesn’t make sense to add two EmployeeIDs together. Equality against other EmployeeIDs? Yep. Comparing an EmployeeID to a ProductID should be undefined. And you could make the case that gt/lt and sorting is an implementation detail. But when everything an int, we lose all of that meaning.
    
    I’m sure people have more enlightened thoughts in this area. :)
    
    wezm
    
    You can do the same thing as described in the post. E.g. in one of my projects I have separate newtypes for all the primary keys:
    
    #[derive(sqlx::Type)] #[sqlx(transparent)] pub struct BookmarkId(i64); #[derive(Debug, FromRow)] pub struct Bookmark { pub id: BookmarkId, pub user_id: UserId, pub url: String, pub title: String, pub author: Option<String>, pub notes: Option<String>, pub read: bool, pub public: bool, pub created_at: Timestamp, pub updated_at: Timestamp, }
    
    sqlx(transparent) makes it so that it can be used directly in queries without having to get the inner value out, but in the Rust code all the ids are a separate type.
    
    masklinn
    
    And you don’t even need high level libraries for it, raw rusqlite will do it fine, you just have to implement a few traits by hand (or macro) and you’re good.
    
    Typed ids is a nice pattern (if just a subset of solving primitive obsession).
    
    junon
    
    You can also do #[repr(transparent)] to give it the same memory layout as the single tuple inner type.
    
    wbolster
    
    python has typing.NewType that gets you partially there at the type checker level (e.g. mypy). it prevents passing a EmployeeID to a function accepting a ProductID, and it requires these ids to be constructed explicitly:
    
    EmployeeID = typing.NewType("EmployeeID", int) ProductID = typing.NewType("ProductID", int) foo = EmployeeID(12)
    
    …but it’s not a panacea, b/c the values will behave as regular integers at runtime, e.g. they compare equal and can be mixed in arithmetic operations:
    
    $ python >>> EmployeeID(123) == ProductID(123) True >>> EmployeeID(1) + ProductID(2) 3
    
    matklad
    
    What’s up with the first cols assignment? It’s unused, no?
    
    robn
    
    Good spot :) Yeah, another thing a compiler won’t tell you about, though it will quietly optimise it away, so it’s no big deal. Unfortunately there’s loads of this stuff throughout the OpenZFS codebase like this. Static analysis finds a good chunk of it (including that one), but it takes time to go through and clean up, and alas we’re very short on time.
    
    As it happens though, when I have a few spare minutes and the mood takes, I do sometimes take a swing at tidying up things like this. Current pile of patches is robn/zfs#tidy-your-room, and this particular one is already sorted in robn/zfs@288782ae01.
    
    matklad
    
    so it’s no big deal
    
    The problem isn’t performance, the problem is that this likely points towards a bug introduced during a refactor.
    
    robn
    
    In this case, the behaviour was changed a little, but the initialiser wasn’t. There is no bug there, just untidyness. And it’s true that untidyness makes it harder to spot new messes, so I definitely agree its a deal, but then we’re back to programmer time being scarce. We do what we can!
    
    gregnavis
    
    Unfortunately there’s loads of this stuff throughout the OpenZFS codebase like this
    
    That surprised me. Why is that?
    
    robn
    
    Mostly, it’s old, there’s limited time available for general maintenance and this stuff rarely gets in the way.
    
    spc476
    
    Curious. I can use -Wall -Wextra in GCC/clang and get an unused warning [1], but I I wonder if adding those options now (if they aren’t being used) will generate too much noise (I always compile C code with them, plus some extra ones that aren’t covered by those two).
    
    [1] I even added such a warning to an assembler I wrote, so it’s not exactly rocket science these days.
    
    masklinn
    
    I tried it on godbolt, neither GCC (15.1, Wall, Wextra) nor clang (20.1, Wall, Weverything) flag an unused store. And since psize is updated in place it has at least one read which satisfies both compilers (also works on Go, and I assume Zig).
    
    robn
    
    If it’s a single-step transform, -Wunused-variable will catch it:
    
    uint64_t asize_to_psize(uint64_t asize) { uint64_t psize = asize << 1; return asize; } $ gcc -Wall -Wextra -c size.c size.c: In function ‘asize_to_psize’: size.c:5:14: warning: unused variable ‘psize’ [-Wunused-variable] 5 | uint64_t psize = asize << 1; | ^~~~~
    
    If it has multiple steps, it will not:
    
    uint64_t asize_to_psize(uint64_t asize) { uint64_t psize = asize << 1; psize <<= 1; psize *= asize + 3; return asize; } $ gcc -Wall -Wextra -c size.c $
    
    (fwiw, gcc version 12.2.0 (Debian 12.2.0-14+deb12u1)).
    
    That’s correct behaviour by the documentaion, which specifically says “unused aside from its declaration”. My (rough) understanding is that its “stored” (in a register or whatever), it’s no longer “unused” by this definition.
    
    masklinn
    
    I don’t think “stored” really matters, just that the variable itself is read once at the langage level (the entire thing will be SSA’d long before it gets lowered to assembly after all). Any read of the variable will satisfy the “variable is used” predicate and suppress the warning (/ error in go and zig, in the latter the langage server will even update your code with _ = var to avoid the compiler complaining, an exercise in insanity if I’ve ever seen once).
    
    masklinn
    
    Yeah, another thing a compiler won’t tell you about
    
    Some compilers do, fwiw (though maybe not C compilers).
    
    robn
    
    Well, C is implied here :)
    
    masklinn
    
    It’s making me think tho: rustc does flag it, c2rust exists, and even though the code it produces has its issues maybe it surfaces this issue? Because it’s not part of the C semantics so c2rust should have no reason to work around it.
    
    edit: it kinda works but c2rust allows dead_code and unused_assignments by default, so you need to remove those.
    
    altano
    
    I had the same question
    
    andrewrk
    
    haha! Despite being roasted, I’m happy to see this post-
    
    When I worked at Backtrace, I attempted to write all my C code this way, but the CTO insisted on putting all the locals at the top of a function as variables. This really frustrated me. So it pleases me to see the culture shifting with posts like this.
    
    However… I have to call out the post a little bit for not showing the command used to compile the C code, because by default, this is not caught:
    
    [nix-shell:~/tmp]$ clang -c bug.c [nix-shell:~/tmp]$ gcc -c bug.c [nix-shell:~/tmp]$ clang --version clang version 20.1.5 Target: x86_64-unknown-linux-gnu Thread model: posix InstalledDir: /nix/store/kfav6wkyglcpgrajyfq0zdgfqlm3cwky-clang-20.1.5/bin [nix-shell:~/tmp]$ gcc --version gcc (GCC) 14.2.1 20250322 Copyright (C) 2024 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
    
    mtlynch
    
    I have to call out the post a little bit for not showing the command used to compile the C code
    
    He sort of does because the flag is in the warning output, but I agree it could be clearer (and more true to mirroring your post):
    
    t.c:30:12: warning: unused variable 'psize' [-Wunused-variable] ^^^^^^^^^^^^^^^^^
    
    I can reproduce it with gcc:
    
    $ gcc -c -Wunused-variable t.c t.c: In function ‘vdev_raidz_asize_to_psize’: t.c:30:15: warning: unused variable ‘psize’ [-Wunused-variable] 30 | const u64 psize = parity_adjusted << ashift; | ^~~~~ $ gcc --version | head -n 1 gcc (GCC) 14.3.0
    
    andrewrk
    
    Oh that’s also a great point. It was a warning only, returning exit code 0. So in a larger project this would have very likely scrolled by with nobody noticing it.
    
    robn
    
    Every large C codebase I’ve worked on uses -Werror, at least where it existed (introduced in .. GCC 4 maybe?). If OpenZFS hadn’t, enabling it and dealing with the fallout would have been one of the first things I did when I arrived.
    
    I am always wary when I see even a single warning when compiling that doesn’t stop the build. It’s not always the project’s fault; build options, compiler versions, libc differences, all sorts of things can change something for the compiler. But to not have -Werror on? If it’s anything important, I go and find out why, in case it speaks to some deeper issue with the development process that I might not want to bet on.
    
    quad
    
    Assuming said project doesn’t use -Werror
    
    mtlynch
    
    Spoiler: The bug would be easier to catch in Zig because the Zig compiler flags unused variables.
    
    I’m surprised that OpenZFS happens without static analysis for unused variables, as I thought that was a pretty easy thing to check for. I wonder how many other bugs fall out if you run more static analysis on the ZFS codebase.
    
    masklinn
    
    Except the bug would not have been caught in Zig without writing the code differently, because the variables were not unused.
    
    And had the code been written differently, in such a way that Zig would have caught it, the C compilers could have caught it as well. The original post even mentions -Wunused.
    
    kornel
    
    Compilers don’t have to track unused variables, they can track unused assignments to the variables:
    
    warning: value assigned to `psize` is never read | | psize <<= ashift; | ^^^^^ | = help: maybe it is overwritten before being read? = note: `#[warn(unused_assignments)]` on by default
    
    masklinn
    
    Yes I know that, I was in the original thread where this sort of things was being discussed. However that is not what the most widespread C compilers (GCC and Clang) do, nor is it what the Zig compiler does.
    
    landon
    
    The fact that I can’t turn this off is why I don’t like writing zig programs.
    
    The fact that other people can is why I dont like running C programs
    
    danlamanna
    
    I just started toying with zig recently and it’s really such an enormous pain. I have no idea why debug builds have to deal with this nonsense - it makes development iterations so frustrating.
    
    Vaelatern
    
    It’s fascinating – do you find rust pleasant?
    
    I’ve found the few times I’ve written rust extremely painful, but enjoyed writing zig a lot.
    
    jmiven
    
    Should be merged with r5omhr :-)
    
    landon
    
    I too was confused as to why the compiler let the original fly
    
    lilyball
    
    This article says
    
    There’s no “cheap” annotation (eg -Wunused) that will notice it.
    
    and goes on to suggest that in Rust you might catch this by using newtyped return values. Well, if you port this function exactly as-is to Rust, without doing any type annotations or adding any compiler flags, Rust will already catch this, saying warning: value assigned to psize is never read. Similarly it will also point out the fact that cols is written to twice in a row. I really wonder why C compilers don’t do that out of the box.
    
    david_chisnall
    
    This is precisely the kind of bug that makes me use C++ instead of C for any systems programming. Not being able to differentiate different quantities is just terrifying for a language used for the most privileged parts of the system.
    
    ibookstein
    
    In C++ I find myself using enum classes with an explicit underlying type for ad-hoc newtype integers. When interop with C is required, the single field struct mentioned above is my poison of choice. I rarely promote them to full-fledged classes with methods and conversions.
    
    robn
    
    You aren’t wrong. I wonder if it’s worth me looking back towards C++ a bit. I wrote quite a lot of it in 2009-2012. I don’t know that I got good at it, but I liked it ok (and without that experience probably wouldn’t have appreciated what Rust was trying to do when it came along). Might be a path of someone less resistance for getting a little more rigor into some of the bits of OpenZFS that could really benefit from it.
    
    david_chisnall
    
    There’s a much easier migration path to C++ from a codebase that is C, since there is a large common subset of C and C++ that makes it easy to incrementally rewrite C as C++. For completely new projects, I think Rust might be a better option. Making Rust work with ownership models that were not originally designed with Rust in mind seems to be challenging but that’s not an issue for clean-slate codebases.
    
    I did hit some issues using C++ in kernel modules for FreeBSD because FreeBSD kernel modules (.ko files) are really just .o files: they don’t get a proper link step and so COMDATs are not resolved, which meant that there were then things left in them that the kernel loader didn’t know what to do with. At some point I want to fix this, but it’s probably a blocker for a lot of C++ constructs in OpenZFS (inline functions, templates).
    
    makishimu
    
    I have to explicitly call this out. This “a one-character fix” is a stellar example of how to write a PR/commit message. Kudos!
    
    robn
    
    Thanks. I try hard on those :)
    
    danlamanna
    
    Do you feel any type of way about the response? Unless there was communication in some other channel, it just looks like an “approved” with no comment. This is the sort of thing I would expect a bit of a postmortem after, but maybe not.
    
    robn
    
    We did talk about it a little internally at Klara, and the original author of that PR (and my colleague) did use the opportunity to run a new set of tests based on my settings (the “aggressive allocator fragmentation” I mentioned in the post, and found some math bugs too (#17490. So in that sense, we took the opportunity to double-check.
    
    In a proper (and for-profit!) product organisation, I think it’d be worth a pause to check if we need to update our processes, or down tools and build out some new test infrastructure, or whatever. In a volunteer- & contract-driven open-source project though, that sort of time and space just doesn’t exist.
    
    So yes, I do sigh a little bit, because these are solved problems in the abstract. But no, I think the response is fine, because there really isn’t much to say. There’s nothing new here that we didn’t already know. We know C is tricky, and we know our testing regime is not as good as we would like. But, it is pretty good, we improve it a little but all the time, and we try to build “general maintenance and targets of opportunity” into everything we do. Sometimes you just have to play the long game.
    
    adavis
    
    I have vivid memories of being in calls back in the day, and you spending more time on a commit message than the change itself.
    
    It really taught me how valuable a commit message can be, and to be kind to your future self when all the context has fallen out of your brain.
    
    alterae
    
    speaking of bugs, their code blocks have a common one (lines in the same code block are different font sizes from one another)
    
    i see this on a lot of people’s blogs, specifically on webkit on ios, and i have no idea why
    
    robn
    
    Curious. This is just whatever Hugo produces with it’s default code highlighter, so I’m guessing that might be why you see it around more often. I’ve never seen it before, but I’m not a frequent iOS user. I’ll borrow a phone off the kid and give it a try, and see if there’s anything obvious I can do about it. Thanks for the heads up!
    
    pekkavaa
    
    It happened on my website and I generate HTML from markdown via Pandoc. Can’t remember how I figured out the fix but I found this from my style.css:
    
    /* Pandoc's default value of display:inline-block shows up as too large on iOS Safari. */ pre > code.sourceCode > span { display:inline; }
    
    jackdk
    
    I was really hoping for something akin to Haskell’s newtype: a type that is representationally the same as some other type, has no runtime overhead, but is nominally distinct. Does Zig have such an idiom? It seems that this might not be possible, as a consequence of “all structs are anonymous” (as the language reference puts it).
    
    masklinn
    
    You just create a named struct with a single member. Or an enum wrapper (https://ziglang.org/devlog/2024/#2024-11-04).
    
    andrewrk
    
    Every struct and enum in Zig makes a new type.
    
    jackdk
    
    Cool, thanks.
    
    BTW, that devlog page says anonymous struct types have been removed. Does that mean the statement “all structs are anonymous” at https://ziglang.org/documentation/master/#Struct-Naming is no longer correct?
    
    andrewrk
    
    No.
    
    I see the confusion here. There used to be an additional weird type in the type system that isn’t even worth explaining, but it’s gone now.
    
    dbremner
    
    @pushcx
    This should be merged into the original along with these two others:
    
    OpenZFS Bug Ported to Zig
    
    OpenZFS Bug Ported to C
    
    makishimu
    
    Please help me understand: why two distinct size types would be desirable? (Rather in general than in this case to avoid this bug.) I can see how tooling that detects unused assignment can help. I can see how having distinct handle types over void * is good. I can see how wrapping a DB primary key into a separate type is useful (as mentioned in the comments). And, of course, I can see how having variable names with Levenshtein distance > 1 is preferable.
    
    However the distinction between physical and allocated sizes is so thin, isn’t it? Both represent an amount of bytes on a device. Both, I believe, can be reasonably summed up and compared. And yet there is an argument to have them as separate types.
    
    Similarly, bondolo in the comments mentioned working on projects where no raw types were allowed. At glance it seems somewhat overkill and dystopian to me. So what am I missing?
    
    robn
    
    However the distinction between physical and allocated sizes is so thin, isn’t it? Both represent an amount of bytes on a device. Both, I believe, can be reasonably summed up and compared. And yet there is an argument to have them as separate types.
    
    The answer is actually in your description. They do indeed both represent “an amount of bytes on a device”. But, they don’t represent an amount of bytes on the same device (or don’t have to). Or, more abstractly, they represent an amount of bytes, yes, but how to interpret them depends on the context.
    
    If we only pass around the raw “number of bytes” value (as a uint64_t), then when it comes time to interpret that value, that “interpetation context” has to come from elsewhere. Usually, that means the programmer just has to get it right. If they get it wrong, we lose the spacecraft, or feel quite uncomfortable in our 104℃ bath.
    
    But, if we can somehow combine the raw value with the context we need to interpret it and pass them around together, then our tools can use this extra knowledge to make sure we always use the value correctly.
    
    If the tool we have is types, then we can encode the context into the type, and the type-checking facility (say, the Rust compiler) can make sure we use it properly. It doesn’t have to be done with types as such; in a Perl or a Python maybe I could serialise it as a two-element array ['A', 8192], and then my “type-checker” would be something that checks that size[0] == 'A'. Or whatever, there’s tons of methods.
    
    The nice thing about lots of languages that aren’t C is you can still arrange for math operations to work on the value part, so you only need to worry about the context when it comes time to use it in a different context. My silly Rust version isn’t quite the whole thing, probably it .. implements Deref? or Add and Sub and stuff. I dunno, my Rust isn’t very strong. There’ll be a crate. It’s not the point.
    
    That was my thinking anyway.
    
    masklinn
    
    My silly Rust version isn’t quite the whole thing, probably it .. implements Deref? or Add and Sub and stuff.
    
    Implementing arithmetic operations probably makes sense if these quantities are used for computations (also Eq/Ord). Deref is less likely, especially for number, in my experience From/Into is a better way to implement “cast to primitive” if it’s needed.
    
    sknebel
    
    Both, I believe, can be reasonably summed up and compared.
    
    Can they? In a way that has a meaningful result the code regularly needs? That seems like a huge assumption to me, and I would easily believe someone actually involved in the project asserting the opposite, as the article seemingly does. If you work with both size types close to each other regularly, but almost never want to have them interact, forcing the developer to be very explicit when they do that seems like a fairly obvious measure to avoid mistakes.
    
    makishimu
    
    Now, after reading new robn’s comments that shed more light onto the specific details of fundamental differences between asize and psize, I can see why having two distinct types would be better. They are two very different amounts of bytes in the domain of OpenZFS.
    
    It’s such a shame typedef is merely an alias in C )-:
    
    On a general note (as I meant it to be), isn’t bondolo’s mentioned “no raw types allowed” approach too much? Or: when is it not too much?
    
    robn
    
    On a general note (as I meant it to be), isn’t bondolo’s mentioned “no raw types allowed” approach too much? Or: when is it not too much?
    
    If you’ve got tools that can understand them as something different, then it’d certainly be worth a look. Without that, I think typedefs to fundamental types are probably only worthwhile when they have a clear, focused meaning as something that really shouldn’t be thought of as a quantity. Anything that gets a lot of math probably shouldn’t be, because there’s other stuff the C programmer needs to think about when doing math (overflow, signedness) that gets harder if the underlying type is not visible.
    
    (I have not thought hard about this, but that’s my initial reaction anyway).
    
    mk12
    
    I agree with you. I think type wrappers for integers can be useful, but they’re very often shown in small examples to suggest they’re an obvious net win that will prevent bugs. In reality their usefulness is inversely proportional to how much casting back and forth you’ll have to do. At best that’s very little and the values really do just flow through the program with minimal interactions. Maybe that’s the case for asize, I don’t know. But at worst casts are everywhere and you’ve just made the program more verbose for nothing. If your eyes glaze over the wrong word they can also glaze over a cast to the wrong word.
    
    robn
    
    Yep, this is really important, and is why I’d be loathe to do it in C, which really doesn’t have the tools to say “I want to use this like a plain old number and just be really strict on the conversions”. If we must rely on reading to catch some kinds of bugs, then anything we do has to be more readable, not less.
    
    In any language with operator overloading. delegation, transparent dereferencing, etc? No brainer.
    
    oomkiller
    
    I wondered how well an LLM would catch this issue vs static analysis or switching to a high powered type system. It did surprisingly well. Sharing just the code snippet and asking to “Review this: ”, all the LLMs I tried immediately noticed the mistake and labeled it a critical issue.
    
    But as this was part of a larger patch, using it for a full review would cause the context to be larger leading to the response having more noise or generally just performing poorly. As mentioned, static analysis tools have a lot of false positives and this is something I’ve noticed can be an issue using LLM for reviewing code as well. So instead of the code snippet, I shared the entire patch. Fortunately, Gemini 2.5 Pro and Claude 4 Sonnet both still noticed the “Critical issue” and boldly stated it must be fixed before moving on. o4-mini-high didn’t seem to catch it, and o3 did notice the issue but put it alongside a bunch of others which may or not be real problems, so I think it flunked.
    
    These tools are evolving and while many choose not to use them for development, it might not be the worst idea to run a hand-coded patch through a quick analysis to see if it can spot an issue the 3 super experienced developers missed.
    
    jsnell
    
    This was my first thought as well, so I did the same experiment before reading the comments.
    
    You don’t even need to go to top tier models like 2.5 Pro or 4 Sonnet; even Gemini 2.5 Flash Lite spots the bug even when given the original diff with no additional context. (Yes, I turned off grounding with Google Search.)