The production bug that made me care about undefined behavior

40 points by dryya

muvlon

in some conditions, Response response; is perfectly fine. In some other cases, this is undefined behavior.

Yup, classic C++. Initialization is a huge complex mess, but somehow they still don't seem to have developed a culture of defensively initializing everything explicitly always and play these silly shorthand games instead.

At least in this case, there is maybe an out. There is a serious proposal to change the language s.t. it always zero-initializes local variables: https://isocpp.org/files/papers/P2723R0.html

And before you scream about performance, go read their report, please. They address this very well.

0x2ba22e11
And before you scream about performance, go read their report, please. They address this very well.

My intuition is:
- your code either does touch uninit variables or it doesn't
- if it doesn't, a modern compiler is likely to be able to prove that and eliminate the redundant init so there is usually no performance impact, and even if it doesn't then the redundant writes probably get combined in L1D anyway
- if it does, then that has been UB and bad performance is the least of your troubles anyway
- muvlon
  Yep, your intuition is basically spot-on. There is a fun extra bit though: They found a few cases where
  
  There is no UB (i.e. uninit is not touched),
  
  the compiler was not able to figure this out and had to insert an extra xor eax eax; or similar and
  
  (the surprising part) this extra instruction made the program slightly faster!
  
  Their explanation IIUC is that zeroing out a register tells the CPU very plainly that there is no data dependency between the old value that was in there and whatever you're about to use it for. So the register renaming part of the CPU gets to assign a new microarchitectural register to that ISA register, potentially leading to better pipelining. My takeaway is that clear, simple semantics can be good not only for humans and compilers but also for the hardware.
  
  Now of course you may retort "but since it was UB before, it would have been a valid optimization to add that instruction already!", and that's correct. I'm not claiming the spec change is unlocking new optimizations, just that the research surrounding it is changing (some of) our intuitions about performance.
  - 0x2ba22e11
    
    Oh neat, thanks!
- olliej
  
  This has been accepted into c++26 (alongside erroneous behavior - an obnoxious compromise to deal with people who insist that these are errors so should not be given defined behaviour - ie “we would rather massive foot guns that don’t exist in any other language the specify the behaviour in any way”).
  
  This has been shipping by default in xcodes clang for a few years at this point.
  
  Re: weird ass c++ init rules: I just assume any pod typed values are never initialized. Once you do that the rules are much simpler, and it’s not really that pessimistic given that actual behaviour :)
  - davmac
    
    (alongside erroneous behavior - an obnoxious compromise to deal with people who insist that these are errors so should not be given defined behaviour - ie “we would rather massive foot guns that don’t exist in any other language the specify the behaviour in any way”)
    
    My understanding of erroneous behaviour is that it does cover a restricted set of behaviours which are specified, unlike UB which is "anything can happen". The "massive footguns" are therefore not so massive.
    
    For example, reading an uninitialised variable will (if it does not specifically cause a runtime error) actually yield a value. The program may not behave as the programmer intended, but that behaviour will at least be bounded.
    
    olliej
    
    IIRC it's specified as the outcome of EB must be independent of the machine state.
    
    The important thing is that while EB is obviously an error, the compilers are not permitted to then use it as proof of impossibility, which is what makes UB so incredibly pernicious.
    
    The problem with UB is not just "anything can happen when you do UB", it is "anything can happen if any UB ever happens anywhere in your program" - from the standard's point of view it's not "this operation in my code is UB" it is "the entire program is UB". The "anything can happen" outcome does not require you to actually perform the UB operation, it is the mere presence of UB that results in "anything can happen".
    
    davmac
    
    The important thing is that while EB is obviously an error, the compilers are not permitted to then use it as proof of impossibility
    
    While pointing out this technical distinction hasn't done me any favours on this forum recently, I'd point out that compilers do not use UB as proof of impossibility. They use the necessary presence of UB on some execution path as an indication that they need not infer constraints from (including generate code for) that execution path, which is not the same as assuming the execution path cannot be reached at all. The end result is in some cases the same, but not always, because sometimes compilers choose not to drop all constraints from such paths.
    
    Anyway, yes, the "massive footgun" of UB has been removed from these particular cases that now have "erroneous behaviour" instead of UB, in contradiction to your original statement, which was my point.
    
    IIRC it's specified as the outcome of EB must be independent of the machine state.
    
    I did not claim otherwise.
    
    spc476
    
    ... reading an uninitialised variable will (if it does not specifically cause a runtime error) actually yield a value.
    
    Or it causes a trap because the value is an integer trap value (which apparently existed on enough systems to make it into the C standard). C alone has over 200 UB causes, which C++ inherits much of.
    
    davmac
    
    Or it causes a trap
    
    That was meant to be covered by:
    
    (if it does not specifically cause a runtime error)
    
    Edit: though, on later reflection, I'm wondering if you are saying that an erroneous value can have a trap representation, meaning that using the value would still be undefined behaviour due to the representation. Interesting point, though of course primitive types on modern, real implementations do not have trap representations, with the arguable exception of bool. The C++ standard is (surprisingly, to me) a little vague about invalid representations compared to the C standard, and for example doesn't mention trap representations.
    
    Anyway, I would expect a reasonable implementation to ensure either that erroneous values with a trap representation did in fact cause a trap rather than any other behaviour, or that erroneous values were never comprised by a trap representation. I'm a bit concerned now to note that the proposal (https://isocpp.org/files/papers/P2795R5.html) specifically calls out a case where an erroneous bool value results in UB (because the representation happens to be invalid for bool). That seems a bit 2-steps-forwards-one-step-backwards.
    
    muvlon
    
    I just assume any pod typed values are never initialized.
    
    That's a decent defensive stance, but the issue (as shown in the example from the post) is that anything might transitively contain pod types, you can't tell just by looking at Response response;.
  - tomsmeding
    
    I think there's a much easier way to think about what a simple T x; initialisation does: it calls the default constructor of T. How is this constructor defined?
    
    For structs/classes with a custom default constructor, it's that one.
    
    For structs/classes without such, it calls the default constructors of all fields (or a value constructor if there's an initial value specified).
    
    For C-style arrays, it calls the default constructors of the elements. This is exactly analogous to structs.
    
    For primitive types, the default constructor does nothing.
    
    This describes the same rules as the list in the article, but contrary to that list, it's nicely inductively defined and actually makes sense. The only difference between std::string[10] and int[10] is between std::string and int; the array has nothing to do with it.
    
    Now whether this is a good idea, or whether it makes sense that computing with the value of an uninitialised integer is undefined behaviour but not for char, is a separate question entirely, and there I mostly agree with the rant in the OP. :P
    
    snej
    
    Exactly. It’s not complicated; as far as this bug is concerned, you just have to know that primitive types like int have to be explicitly initialized. Even if they’re inside a struct. That's exactly like C.
    
    Clang can generally detect when a bare variable (not in a struct) is used before initialization. I guess being inside a struct complicates the analysis too much? But there is a check in clang-tidy that warns about uninitialized fields.
    
    Also, worth noting that in C++26 using an uninitialized variable is “upgraded” from UB to Erroneous Behavior which the compiler itself can flag as an error.
    
    kornel
    
    you just have to know
    
    It's not a "just". You need to know definitions of specific types in each codebase, and that information is non-local and can change over time.
    
    It's an awful user interface. It's like having a light switch in your house that either switches the lights on or sets the house on fire, depending on whether "don't set house on fire" switch in the basement is on. It's not complicated!
    
    tomsmeding
    
    You're right that the fact that the rules are not so complicated doesn't mean that the system as a whole is a good idea, or pleasant to work in, or anything. However, I think it's still worth trying present the rules as simply as possible: the simpler we present the rules, the easier it becomes to get intuition on the footguns, so that you can more effectively avoid them in practice.
    
    mologie
    
    Regarding the char assert example: ASan (address sanitizer) indeed will not catch this, because no addressing rules are violated. UbSan (undefined behavior sanitizer) on the other hand should detect that an uninitialized value is copied. Fun stuff.
    
    MaskRay
    
    MemorySanitizer should have caught this use of uninitiated value