Everything in C is undefined behavior

28 points by gerikson

davmac

This is ridiculous. It's saying that literally that:

int foo(const int* p) {
   return *p;
}

... "has" UB because you can pass it a pointer that's not properly aligned, and that's UB.

That's ignoring that you already have UB before the function is called, in that case, and also trivial things like that you can pass a null pointer to this function and it would be UB in that case too, but nobody would normally claim that this means the function "has" UB.

Yeah C has a lot of undefined behaviour, but this post is just over-exaggerating waffle.

majaha

Did you read the article? They talk about that under the next heading.
- davmac
  
  They talk about that under the next heading
  
  Sure, they say something that's total rubbish and follow it up with a partial correction of that (which is framed not as a correction but an addition, "it's even worse than that!").
  
  Did you read my comment? It's not only making the point about unaligned pointers already being UB.
spc476

The examples in the article are not good as one would have to write pretty atrocious code to hit undefined behavior with pointers [1]. And if the author read a bit further in the C23 standard, they would have seen that isxdigit() already requires handing negative numbers as in C23 7.23.1.4, EOF is defined as a negative integer.

And I don't think LLMs are, or even will be, better at UB than we are. It's trained on existing C code. Caveat emptor and all that.

[1] As long as you avoid pointer arithmetic and casting pointers (if you have to cast a pointer, you're probably doing it wrong). If you are passing a pointer to an array, mark it as such in the signature, like int foo(foo array[],size_t n), which at least signals intent (even if the underlying object is still a pointer).
- lemon
  
  they would have seen that isxdigit() already requires handing negative numbers
  
  That's true, but only EOF as a negative number.
  
  The standard is very direct about the ctype functions (C23 §7.4.1):
  
  In all cases the argument is an int, the value of which shall be representable as an unsigned char or shall equal the value of the macro EOF. If the argument has any other value, the behavior is undefined.
  
  On systems where char is signed, passing a char that is negative and not equal to EOF is UB.
  
  For example, glibc uses a lookup table, but explicitly to avoid breaking code that makes that mistake, they support negative char values too, so a range of -128..255: https://sourceware.org/git/?p=glibc.git;a=blob;f=ctype/ctype.h;hb=66f3e9219d8f86b977d9be04ad469b5d72af0da2#l71 (But it will crash with an integer outside that range, for example an int with the the value of a non-ascii unicode codepoint)
nposting

I don't really like this style of dunking on languages with contrived examples and then justifying them after. (Some of) these issues can be a real problem, but if they are you can usually find a patch in real software. Like, "It's not rare for the denominator to come from untrusted input"...what?

"It's unsafe to write in $LANGUAGE without an LLM supervising you" is exceptionally condescending, especially when the evidence is "I found a bunch of UB in OpenBSD! No I won't show you."
- Aks
  
  "It's unsafe to write in $LANGUAGE without an LLM supervising you"
  
  After reading that, I felt like I was just reading another ad.
akavel

AFAIU, sentences like this:

On SPARC it would cause a SIGBUS.

already make the article misleading and wrong. It's a common misunderstanding of UB, which suggests that UB is merely implementation-defined behavior, or hardware-defined behavior. It is not. Rather, UB "nasal demons" and "spooky action at a distance" territory: not at the level of hardware, but at the level of "random code generation or removal" in the compiler in a random place in your code. Your codebase in seemingly unrelated areas of code can change in completely unexpected ways as a result of having UB elsewhere in it - even if the UB-originating fragment of code is never executed at all!
- majaha
  
  even if the UB-originating fragment of code is never executed at all!
  
  This isn't really true. The execution of a piece of code can have Undefined Behaviour on a particular run of the program, but if but if Undefined Behaviour never gets executed then the rest of program will execute as defined. It can change the code that gets generated for it though.
  - muvlon
    
    Yes. The compiler being allowed to assume never happens can have weird nonlocal effects, but UB can not "time travel" arbitrarily, especially not in ways that violate causality.
    
    Here's an easy way to see this:
    
    void foo(int* p) { if(p != NULL) { printf("%d\n", *p); } }
    
    If p is null, then *p is UB. Now if UB could time travel arbitrarily, it could go backwards and just make the null check misbehave, so the branch is executed, causing the UB in the first place. But of course it can't, null checks work.
    
    There was some language in older versions of the standard that could indeed be interpreted as "UB means all bets are off, every other part of the program can misbehave arbitrarily, even causing the UB to happen in the first place, but of course that's silly, because it means every program is always UB. IIRC, this has been fixed and now UB can only time travel backwards across code that is causally unrelated to the UB happening and doesn't contain side effects or something of the sort.
    
    akavel
    
    Hmmm, ok, interesting, thanks. I trust you, and that would make some sense, and then it would mean there's at least some thin veil of sanity near UB. FWIW, this indeed does seem corroborated by one of the cool links I keep on UB:
    
    However, if your program avoids the code paths which trigger undefined behavior, then you are safe.
  - pyj
    
    Comment removed by author
- lcamtuf
  
  In addition to what others have said (these are pretty contrived / odd examples), I'd point out that compilers have gotten vastly better at pointing out UB before acting on it. So for most practical intents and purposes, the situation is much better than ~15 years ago.
  
  I believe that many of the UBs in the standard are unnecessary and there's some research showing they don't really offer the claimed optimization benefits (https://web.ist.utl.pt/nuno.lopes/pubs/ub-pldi25.pdf), but in a semi-competently-written program where you're paying attention to compiler warnings, UB is a pretty unlikely source of functional or security bugs. Bounds checking and pointer lifetime issues, on the other hand...
  - lcapaldo
    
    UB is a pretty unlikely source of functional or security bugs. Bounds checking and pointer lifetime issues, on the other hand...
    
    Are not out of bounds accesses and use after frees instances of UB?
    
    lcamtuf
    
    Technically, sorta, but I think the distinction matters. The usual concern we talk about when describing something as UB is that you write seemingly sensible code and the compiler goes crazy and does something unexpected, such as removing the entire code block (e.g., "optimizing out" an overflow check on a signed integer). In contrast, the usual concern with OOB / UAF is that you write code that's clearly incorrect and the compiler does precisely what it's instructed to do, letting you shoot yourself in the foot.
    
    In the latter case, the outcome doesn't depend on the access being technically UB, because (a) almost by definition, OOB / UAF happens in situations where the compiler couldn't make that determination and didn't complain / optimize out the access; (b) even if the spec said this is defined behavior, you'd still have a security bug in the presence of direct memory access primitives.
    
    lcapaldo
    
    I would probably not have a quibble (or at least less of one) if you had said something like “Aside from oob/lifetime issues most UB is not a source of functional or security bugs.” It seems confusing if not misleading to say “UB doesn’t cause functional/security bugs, but [subset of UB] on the other hand.” I agree that you can distinguish between different triggers of UB, and the distinction you were making and the sense you were using UB would have been clearer if it had been called out.
    
    (You can still write “actually wrong” overflow checks even if the compiler is naive about it, eg, if overflow saturates, which leads to a real bug, no optimization required, but maybe this mistake wouldn’t fall into the “semi-competent” category?).
    
    lcamtuf
    
    “Aside from oob/lifetime issues most UB is not a source of functional or security bugs.”
    
    Again, my point is that it's a red herring. Make a C compiler that's identical to GCC except it doesn't have any "undefined behavior" optimizations in place; everything has defined semantics. In fact, you don't really need to do that, you get pretty close to that if you just disable optimizations in stock GCC (-O0). In that world, UAF / OOB is still exploitable.
    
    lcapaldo
    
    This strikes me as anachronistic, none of these UBs were added to enable optimizations. Simply removing the optimizations is not sufficient to defang them. Eg integer divide by zero is UB, and causes functional bugs, different functional bugs depending on the platform (eg trap on x86, eg results in 0 on ARM). Again no optimization is needed. It’s the same with my earlier example of signed integer overflow, no optimization is required to have a post operation overflow check that doesn’t work and whose branch is never taken.
    
    I don’t disagree that these wild optimizations are crazy to reason about, and that many things could be more narrowly defined, but many are still broken without the optimizations. Not optimizing away a null check because you already dereferenced the pointer means you’re still gonna have null pointer dereference. Again not defending the optimization, but UB is distinct from the optimization that assumes UB doesn’t happen.
    
    UB also means it is perfectly fine to have an implementation that bounds checks everything and prevents uaf. You don’t need something so strong as UB for that but it allows it. In that world it is not exploitable.
    
    You cannot remove undefined behavior by merely deleting its mention, you have to provide a definition otherwise you still have undefined behavior but now implicitly, the behavior is literally undefined if no description of it appears.
    
    chinmay
    
    "optimizing out" an overflow check on a signed integer). In contrast, the usual concern with OOB / UAF
    
    how is that different from optimizing out bounds checks because of OOB being UB?
    
    lcamtuf
    
    Because it's two different things with different root causes? If we wanted to be obtuse, then everything problem in computer security (and beyond) can be reduced to a single mnemonic: "confused deputy". But that's not useful. It's good to have a taxonomy that distinguishes between specific causes and outcomes, because the fixes will be different.
    
    Again, the key point here is that OOB / UAF can still happen in C/C++ even if you edit the spec and remove all mentions of "undefined behavior". It doesn't depend on being formally declared UB. It depends on direct memory access + no compiler-enforced or hardware-enforced range checking / lifetime tracking.
    
    fanf
    
    I believe that many of the UBs in the standard are unnecessary
    
    Happily (remarkably!) the C committee agrees and is in the process of getting rid of gratuitous UB. Look at the first few pages of the current working draft, where they list the working papers whose changes have been incorporated. The “slay…” papers are all about reducing UB.
    
    stephc_int13
    
    This is a bad faith opinion piece disguised as a technical article.
    
    The whole discourse around UB "erasing your entire hard drive" is something I find quite irritating.
    
    I've been using C as my main programming language for 25+ years. I've never been annoyed even a single time by UB.
    
    Of course I almost never use the C std lib, I consider it part of the legacy an not something we should perpetuate as the default.
    
    Writing your own string and memory routines, as well as higher level constructs is not hard.
    
    UB is non-issue.
    
    spc476
    
    The only times I've been annoyed by UB in C (35+ years of experience) has been working on pre-ANSI C code bases (and there are some still around, especially when working with retro technology). Lots of signed integers compared against unsigned integers, sequences like (real example): switch((*ptr++ << 8) + *ptr++) ... and the assumption that pointers and integers are interchangeable. Stuff that gives C a bad name.
    
    akavel
    
    Well, FWIW, I quit C++ and swore to never come back after a ~week-long session, down to gdb and disassembly, chasing a nondeterministic bug in a company's gnarly in-house multi-threaded C++ templates framework, written by in-house C++ standard contributors, which showed up to be one small issue between a race condition and UB. It was the first time I actually heard about UB after years with the languages, and it was an eye-opener. I try to bookmark some particularly valuable UB demos since then, though I've lost the bookmarks a few times, but my current collection is here: https://akavel.com/@cpp (including one specifically designed to erase your disk for funsies). Sure, I'm talking here about C++ in particular, which is not the same as C. But they're close enough that I feel burned by both since then.
    
    ayba
    
    UB is not magic, C specifies what works and everything that is not on the path is a void rather than a wall, you like, or not. Personally, I learned to like having no safeguards, to force me understand how the same code can work differently on various osses and architectures. I see it like a game in the game.
    
    As for the standard library obscure legacy wierdness, yes, it's quite bad.
    
    majaha
    
    There's a lot of nay-saying happening in this comment section which I think is quite unfair. I thought this was a good collection of C++ pitfalls, many of which I wasn't aware of and look very easy to fall afoul of. The bonus non-UB example is also surprising (though not well explained in the article).
    
    spc476
    
    It's still not a good collection. I mean, per this article, this bit of code:
    
    int foo(int a,int b) { return a + b; }
    
    is automatically UB because the addition could overflow even if you never call it with parameters that could overflow. And as I already mentioned, isxdigit() has to deal with negative values as EOF is defined as a negative number (author didn't read far enough down in the C23 standard). And while this is, per C rules, undefined behavior (because the all-0 bit pattern isn't a NULL pointer):
    
    int *p; memset(&p,0,sizeof(p));
    
    it is defined by POSIX (or is it still undefined behavior on a POSIX system?)
    
    There are plenty of things you can criticize C for without the fear mongering. For instance, yes, the integer promotion rules could be better. The Standard C library has issues (lots of hidden state for instance, and some functions like atoi() are just badly designed). You have to do manually (like memory tracking, or object orientation) what other languages can do automatically. Bashing C with UB is an easy target. One can criticize better.
    
    Edit: for clarity around my opening statements.
    
    gerikson
    
    Meta-note: article discusses (among other things) using LLMs to detect undefined behavior in existing code. Tag "vibecoding" added to protect the innocent.
    
    intelfx
    
    Meta-note: article discusses <...> using LLMs <...> Tag "vibecoding" added
    
    This is getting ridiculous.
    
    What next, if an LLM flew within a radius of 100 meters from the article, it also gets slapped with a witch hunter's mark?
    
    gerikson
    
    I predicted that if I didn't add the tag, someone else would. Your mileage may vary. If you don't filter on the tag, it's not relevant.
    
    Edit I only added the comment because of recent discussions around the inclusion of LLM-generated (and by extension, LLM-adjacent) content. I'm trying to be better a submitter going forward.
    
    jmmv
    
    What next, if an LLM flew within a radius of 100 meters from the article, it also gets slapped with a witch hunter's mark?
    
    And that's probably already true today, because if you dared to use Google to do any sort of research to write an article, you'd have been exposed to an LLM and some of it might have leaked into your text. OH NO!
    
    Yes, the abuse of the vibecoding tag is ridiculous.
    
    gerikson
    
    My guideline is: would someone filtering the tag be interested in reading the submission?
    
    Anyway, I've learned my lesson. Next time I'll just add the tag and not mention it. Much more pleasant experience.
    
    technetium
    
    Please consider that many who filter the tag may still be interested in this story—not everyone is that puritan. It goes both ways.
    
    jmmv
    
    Exactly. The whole premise of the article was NOT about LLMs. It just finished with some chat about LLMs that I was happy to skip because "it's old news" -- but everything leading to it was interesting on its own.
    
    alurm
    
    Didn't consider how NULL interacts with variadic functions. Thanks.