How memory safety CVEs differ between Rust and C/C++
11 points by nrposner
11 points by nrposner
So this might be rather naive, but considering how many issues in C/C++ seem to be about undefined behaviour: who don't they just define it?
In this revision round the C committee is reducing UB in the language: see the “slaying earthly demons” documents at https://open-std.org/jtc1/sc22/wg14/www/wg14_document_log.htm
As far as I know they mostly haven’t started tackling the library yet, though functions that take a size argument have been changed to behave sensibly with null pointers (because that was related to a language change that permits adding 0 to null pointers). There are plenty of other functions that could be similarly fixed, but changes to getenv() should perhaps be coordinated with POSIX.
I'd say there are at least three different reasons for something being UB in the standard:
Weird historical stuff nobody cares about anymore, you can "just define it". As @fanf notes, they're working on this. One example might be having a source code file with an unterminated string literal (yes really, this is UB in C).
Stuff that you could "just define", but it might cost you performance. An example would be signed integer overflow. If you defined that to simply wrap around, it would no longer be UB, but the compiler would not be able to do optimizations based on the assumption that it never happens. This is unlikely to be fixed because the committee is stacked with compiler people and compiler people tend to be benchmark fiends. That being said, sometimes there is movement here. For example, P2723 proposes for C++ that all local variables that would otherwise be uninitialized are implicitly zero-initialized.
Things that you cannot reasonably define behavior for. A good example is use-after-free. Without essentially mandating that everyone adopts a heavyweight runtime capability system such as Fil-C (which is not happening) or adding pervasive Rust-like lifetime annotations to the language (super duper not happening, formally shot down by WG21), how can you limit the set of behaviors exhibited by a use-after-free? Sure, you could go ahead and specify "on use-after-free, you touch whatever memory now happens to be there, or you get a segfault/abort". However, that doesn't help anybody. Use-after-free is still just as dangerous, you still get so the same amount of CVEs. You cannot actually make any useful statements about what your program can or can't do beyond a use-after-free, so this is just UB by any other name.
Sadly, the third category is by far the most impactful, and so while it's good that some things are being "just defined already", it's not moving the needle too much.
For example, [P2723](https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2023/p2723r1.html proposes for C++ that all local variables that would otherwise be uninitialized are implicitly zero-initialized.
Something similar was adopted for C++26 via P2795. From [basic.indet]:
When storage for an object with automatic or dynamic storage duration is obtained, the bytes comprising the storage for the object have the following initial value:
--- If the object has dynamic storage duration, or is the object associated with a variable or function parameter whose first declaration is marked with the [[indeterminate]] attribute ([dcl.attr.indet]), the bytes have indeterminate values;
--- otherwise, the bytes have erroneous values, where each value is determined by the implementation independently of the state of the program.
Some of the undefined behaviors have been defined over time, but many of them must exist for optimization reasons. A well-known example is that for (int ii = 0; ii < something; ii++) relies on signed integer overflow being undefined so it can ignore the possibility of something == INT_MAX, which enables various loop transformations.
In Rust the equivalent functionality is divided into safe functions (which might be slightly slower) and unsafe functions (which permit UB if misused). See i32::wrapping_add() vs i32::unchecked_add().
If you took C and added a notation for marking some functions as unsafe and for enabling use of unsafe functions in a region then you could start defining safe variants, but at some point the level of effort to change C (more importantly, change the minds of the people who control C) doesn't make sense and it's easier to find a language that's better aligned with your goals.
The most commonly repeated explanation is that leaving certain behaviors undefined enables optimizations that would otherwise not be permitted. However, I believe this is mostly cope. Almost all such performance gains are niche and minimal at best. If you have a function that calls rm -rf / and is never called, and you make a function pointer call with UB, the compiler is technically permitted to generate code that unconditionally calls that function that wipes your disk. Ultimately, it's just bad spec design and legacy.
Undefined behaviour is often impossible to define. For example, if you write to item 11 in an array of 10 items, it entirely depends on what's there as to what happens next. Maybe it's a bool that tells the bomb to blow up and you just wrote a non zero value to it. Oops! You could prevent creating pointers to those locations, but then it would hardly be considered a systems/low-level language at that point.
Rust does exactly that and is still considered a systems/low-level language so that's not really a true statement. What is true is that you would be dramatically changing how the language type checks and compiles in ways that are not backwards compatible. And that's the real reason this doesn't happen.
The equivalent in Rust is also UB. It's just that the UB is only possible in unsafe.
Yes, that's sort of the point I guess. You are limiting the potential sources in useful ways. Nearly every language gives you some version of unsafe though. Sometimes it via the FFI interface sometimes it's a keyword. The presence of UB doesn't determine if you are a systems/low level language is the point I was trying make somewhat poorly.
Acknowledging that what is and is not a systems language isn’t formally defined, it would seem to me that the presence of UB in the language is at least well-correlated with it. The languages which avoid having any kind of UB (or at least, where it’s considered a fixable bug) are those which enforce certain guarantees with garbage collection and other high-level constructs which would be incompatible with e.g. writing an operating system kernel. Rust’s unsafe is required for really low-level memory manipulation that we can’t do away with at those tolerances.
Is it the case that most languages allow some manner of UB? Arguably via FFI with C/C++, but I’m not sure that’s a statement about the first language.
Well, the root of this thread is "well why don't they just define it?" Saying "Rust defines it" isn't really accurate. Rust makes it impossible in most cases, and that matters, but the equivalent is still not defined, and can't be for the same reason C can't.
I thought dereferencing a null pointer is well-defined behaviour?
https://www.open-std.org/jtc1/sc22/wg14/www/docs/n3220.pdf page 4 (pdf page 18)
- Terms, definitions, and symbols
3.5.3 undefined behavior
EXAMPLE An example of undefined behavior is the behavior on dereferencing a null pointer.
As far as your CPU ISA is concerned yes, but you don't program against that, you program against the C Abstract Machine which says it's UB.