The production bug that made me care about undefined behavior
40 points by dryya
40 points by dryya
in some conditions,
Response response;is perfectly fine. In some other cases, this is undefined behavior.
Yup, classic C++. Initialization is a huge complex mess, but somehow they still don't seem to have developed a culture of defensively initializing everything explicitly always and play these silly shorthand games instead.
At least in this case, there is maybe an out. There is a serious proposal to change the language s.t. it always zero-initializes local variables: https://isocpp.org/files/papers/P2723R0.html
And before you scream about performance, go read their report, please. They address this very well.
And before you scream about performance, go read their report, please. They address this very well.
My intuition is:
Yep, your intuition is basically spot-on. There is a fun extra bit though: They found a few cases where
xor eax eax; or similar andTheir explanation IIUC is that zeroing out a register tells the CPU very plainly that there is no data dependency between the old value that was in there and whatever you're about to use it for. So the register renaming part of the CPU gets to assign a new microarchitectural register to that ISA register, potentially leading to better pipelining. My takeaway is that clear, simple semantics can be good not only for humans and compilers but also for the hardware.
Now of course you may retort "but since it was UB before, it would have been a valid optimization to add that instruction already!", and that's correct. I'm not claiming the spec change is unlocking new optimizations, just that the research surrounding it is changing (some of) our intuitions about performance.
This has been accepted into c++26 (alongside erroneous behavior - an obnoxious compromise to deal with people who insist that these are errors so should not be given defined behaviour - ie “we would rather massive foot guns that don’t exist in any other language the specify the behaviour in any way”).
This has been shipping by default in xcodes clang for a few years at this point.
Re: weird ass c++ init rules: I just assume any pod typed values are never initialized. Once you do that the rules are much simpler, and it’s not really that pessimistic given that actual behaviour :)
(alongside erroneous behavior - an obnoxious compromise to deal with people who insist that these are errors so should not be given defined behaviour - ie “we would rather massive foot guns that don’t exist in any other language the specify the behaviour in any way”)
My understanding of erroneous behaviour is that it does cover a restricted set of behaviours which are specified, unlike UB which is "anything can happen". The "massive footguns" are therefore not so massive.
For example, reading an uninitialised variable will (if it does not specifically cause a runtime error) actually yield a value. The program may not behave as the programmer intended, but that behaviour will at least be bounded.
IIRC it's specified as the outcome of EB must be independent of the machine state.
The important thing is that while EB is obviously an error, the compilers are not permitted to then use it as proof of impossibility, which is what makes UB so incredibly pernicious.
The problem with UB is not just "anything can happen when you do UB", it is "anything can happen if any UB ever happens anywhere in your program" - from the standard's point of view it's not "this operation in my code is UB" it is "the entire program is UB". The "anything can happen" outcome does not require you to actually perform the UB operation, it is the mere presence of UB that results in "anything can happen".
The important thing is that while EB is obviously an error, the compilers are not permitted to then use it as proof of impossibility
While pointing out this technical distinction hasn't done me any favours on this forum recently, I'd point out that compilers do not use UB as proof of impossibility. They use the necessary presence of UB on some execution path as an indication that they need not infer constraints from (including generate code for) that execution path, which is not the same as assuming the execution path cannot be reached at all. The end result is in some cases the same, but not always, because sometimes compilers choose not to drop all constraints from such paths.
Anyway, yes, the "massive footgun" of UB has been removed from these particular cases that now have "erroneous behaviour" instead of UB, in contradiction to your original statement, which was my point.
IIRC it's specified as the outcome of EB must be independent of the machine state.
I did not claim otherwise.
... reading an uninitialised variable will (if it does not specifically cause a runtime error) actually yield a value.
Or it causes a trap because the value is an integer trap value (which apparently existed on enough systems to make it into the C standard). C alone has over 200 UB causes, which C++ inherits much of.
Or it causes a trap
That was meant to be covered by:
(if it does not specifically cause a runtime error)
Edit: though, on later reflection, I'm wondering if you are saying that an erroneous value can have a trap representation, meaning that using the value would still be undefined behaviour due to the representation. Interesting point, though of course primitive types on modern, real implementations do not have trap representations, with the arguable exception of bool. The C++ standard is (surprisingly, to me) a little vague about invalid representations compared to the C standard, and for example doesn't mention trap representations.
Anyway, I would expect a reasonable implementation to ensure either that erroneous values with a trap representation did in fact cause a trap rather than any other behaviour, or that erroneous values were never comprised by a trap representation. I'm a bit concerned now to note that the proposal (https://isocpp.org/files/papers/P2795R5.html) specifically calls out a case where an erroneous bool value results in UB (because the representation happens to be invalid for bool). That seems a bit 2-steps-forwards-one-step-backwards.
I just assume any pod typed values are never initialized.
That's a decent defensive stance, but the issue (as shown in the example from the post) is that anything might transitively contain pod types, you can't tell just by looking at Response response;.
I think there's a much easier way to think about what a simple T x; initialisation does: it calls the default constructor of T. How is this constructor defined?
This describes the same rules as the list in the article, but contrary to that list, it's nicely inductively defined and actually makes sense. The only difference between std::string[10] and int[10] is between std::string and int; the array has nothing to do with it.
Now whether this is a good idea, or whether it makes sense that computing with the value of an uninitialised integer is undefined behaviour but not for char, is a separate question entirely, and there I mostly agree with the rant in the OP. :P
Exactly. It’s not complicated; as far as this bug is concerned, you just have to know that primitive types like int have to be explicitly initialized. Even if they’re inside a struct. That's exactly like C.
Clang can generally detect when a bare variable (not in a struct) is used before initialization. I guess being inside a struct complicates the analysis too much? But there is a check in clang-tidy that warns about uninitialized fields.
Also, worth noting that in C++26 using an uninitialized variable is “upgraded” from UB to Erroneous Behavior which the compiler itself can flag as an error.
you just have to know
It's not a "just". You need to know definitions of specific types in each codebase, and that information is non-local and can change over time.
It's an awful user interface. It's like having a light switch in your house that either switches the lights on or sets the house on fire, depending on whether "don't set house on fire" switch in the basement is on. It's not complicated!
You're right that the fact that the rules are not so complicated doesn't mean that the system as a whole is a good idea, or pleasant to work in, or anything. However, I think it's still worth trying present the rules as simply as possible: the simpler we present the rules, the easier it becomes to get intuition on the footguns, so that you can more effectively avoid them in practice.
Regarding the char assert example: ASan (address sanitizer) indeed will not catch this, because no addressing rules are violated. UbSan (undefined behavior sanitizer) on the other hand should detect that an uninitialized value is copied. Fun stuff.