Obvious Things C Should Do
50 points by susam
50 points by susam
_Static_assert plus compile-time function evaluation does not a unit test make
I read that as ‘no separate build target’, “no separate binary’, ‘unit tests run at compile time’. So, static_assert and compile-time evaluate both facilitate unit tests in a straightforward and reasonable way, that both isn’t there today and requires a lot more work to build out and maintain. Seems like an easy win for the standard, and if dlang’s importc is any indication, a healthy addition to the language.
I assumed that they meant something like rust’s builtin test stuff (and didn’t understand why that would be necessary) it didn’t occur to me that they considered a static assert to be a compile time unit test. You could certainly do actual consteval unittests in C++, I wonder how many people in your team would scream at you if you did so :)
I did this for our message queue after getting the overflow behaviour wrong a few times. Now, there’s a compile-time check that exhaustively checks all possible states and transitions for a bunch of small-sized queues. It adds a few ms to compile time, but no one has complained yet.
I was compiling Haskell so without direct control of the stack it wasn't possible to avoid :D
Who cares when we could, instead:
Remove the UBs.
Check out the “slay some earthly demons” changes listed in the abstract of the current committee draft and in the WG14 document log – tho they seem to have dried up in recent months, surely there are plenty more that are easily dispatched!
Having to add forward declarations and header files is maddening indeed. One of the reasons I gave up on writing C for my own projects was those two. Just foolish busywork.
Sure there can be cases where one may want to have a modified header file, but in my projects that really never was the case.
Of the four items mentioned, forward declarations is the one that annoys me the most. I wish C could handle that, but doing so means the compiler would have to go through the source twice, which may impact compile times (although everybody complains about compile times anyway so ... )
I am in two minds about this, because I do like the clear separation of interface and implementation. I can read a header file and see just the interface to a library, no implementation. This is less true of C++, which requires exposing some bits of implementation in headers.
Languages like Java and Rust have standard tooling to parse the implementation and generate the interface docs, but someone needs to run them. If you’re just browsing a source repo online, you need to find where someone has put these docs. With C, you just read the .h file.
The .h files could have been autogenerated, if the C preprocessor edge cases weren't such a PITA.
In an alternative universe .h files could have been like .o and .d: easy-to-rebuild derived disk litter, not a precious artifact.
That’s more or less what C++ modules do. It makes the build system a lot more complicated. You need to parse every source file to work out which modules it exports parts of, and which modules it needs.
Java simplifies this problem by requiring namespaces to map directly to directories and namespaces to every public class to be in a file with the same name. This means that you can find any included class by a trivial text substitution and filesystem lookup.
Rust requires crates to be in directories and is a bit more clever.
But C often has one .h file for a load of APIs implemented in a bunch of .c files. Part of this came from linker behaviour. Early linkers could remove unused .o files but nothing finer grained. So, for example, each entry point in studio.h would be in a separate file and would then invoke internal things. If you didn’t call any of the printf-like APIs, none of the entry points would be pulled in so neither would the internal implementation.
Modern languages can do that kind of DCE easily, but C had very tight memory limits. The linker needed to be able to iteratively walk symbol tables and pull in things that are used in a single pass. Multi-pass linkers wouldn’t fit in memory.
And, unless you committed your generated .h files to your revision control systems, this still doesn’t come with the same benefits.
Some of these are just historical curiosities about '70s/'80s used to excuse warts in the language. It isn't less of a wart even if it was a critical feature 40 years ago.
I know C couldn't count on even having the full source code in memory. IIRC the earliest implementations didn't have a linker and expected user to define a base address for every compiled file.
But it's so frustrating that C can't seem to untangle itself from the past that doesn't exist any more. Every PDP-11 workaround is now a Chesterton fence or deeply ingrained backwards compatibility conundrum. It took so many decades to just drop trigraphs.
It’s a wart, but it’s also a constraint on how the language can evolve.
A language that has enforced a code structure that provides a mapping from things in APIs to source files can do a lot of what you want quite easily. Doing it in a language which has 50 years of legacy code that has an N:M mapping between public APIs and source files is hard.
C++ tried this with modules. They have been in the standard for years but implementations struggle and I’ve not seen any use of them in projects except as an optional thing as an alternative to precompiled headers. And C++ modules were the result of a load of implementation attempts including Apple shipping Objective-C modules in production for a few years.
Parsing these days is not the bottleneck in essentially all cases. If you compile with -O0 and your files are all in cache, maybe you notice a difference, but compiling C in that way is blazing fast anyway, so in reality I doubt you would.
Is there a specific reason people don't want a two pass compiler on C? 99% of the time, the machine that's compiling the code is a lot more powerful than the machine that the code would run on.
I would like to see a real revolution in C between versions, and those who prefer C99 can stick to it. So many compilers have extensions to deal with C's problems, so it's clear that change is needed.
It would break code if you write for a two-pass C compiler and try to use a one-pass C compiler.
At this point, I think any improvements to C need a solid break from C and be called something else. This language could still compile existing C code, but not the other way around. Some things I would like to see: a whole lot less undefined behavior, and arrays passed with bounds information automatically included (I think this is the original sin in C personally).
It would break code if you write for a two-pass C compiler and try to use a one-pass C compiler.
That isn’t a problem. C99 code will break with a C89 compiler. C23 code will not compile with a C11 compiler. C requires compatibility in the other direction only (you should be able to compile C89 with a C23 compiler). If C26 required a two-pass compiler, that would be fine. And most C compilers are C dialects in a C/C++ compiler, so they already have this machinery.
I just want them to add several functions to the standard library for hash tables and maybe resizable arrays. Not even proper language support, just plain functions in the libc. Such a small addition for such common functionality.
It'd have no effect on existing codebases. No problems with backwards compatibility whatsoever. The only argument against it is it it might add a few kilobytes to an embedded libc.
And yet here we are in 2023...
Hash tables and lists in C are so simple to roll yourself in most cases that it hardly seems like a benefit. Why commit such a small addition to the standard library when guaranteed 99% of users will complain about it and end up rolling their own or pulling in a dependency anyway?
Hash tables and lists in C are so simple to roll yourself in most cases that it hardly seems like a benefit. Why commit such a small addition to the standard library when guaranteed 99% of users will complain about it and end up rolling their own or pulling in a dependency anyway?
Here's an argument for adding them. Correct and reasonably efficient hash tables and some form of dynamic lists are not simple for beginners to roll themselves. If they were in the standard library, then C would become far, far more approachable for new (and relatively new) programmers.
Maybe new and relatively new programmers should not use C. I don't have strong feelings about that one way or the other. But as you and dmpk2k seem to agree, the additions wouldn't add significant size to the standard library, and experts can simply ignore them and continue to write their own or use a third-party version that they prefer.
To put this another way, people can complain all they like (though I doubt that 99% of C coders would care that much), but nobody is harmed by such a small addition and some people are definitely benefitted.
The problem with a standard library hash table in C is that you have to pick one of two approaches:
Option 1, a type-erased hash table where the keys and values are void*, and hash and compare functions are callbacks. This adds some pointer-chasing overhead (you can’t store objects inline, each entry must be heap allocated). It’s also not even slightly type safe, but C lets you implicitly cast from void* so the lack of type safety makes it easy to introduce bugs.
Option 2, a type-safe hash table implemented as macros. This comes with debugging difficulty (if you think C++ template error messages are bad, wait until you see the compiler errors for C macro equivalents). These can do inline storage but now you need to be really careful about object lifetimes during iteration and that may expose implementation details. You either do what C++ did for std::map and enforce a specific data structure (ideally a better one) or you have code where some mutations interleaved with iteration work on some platforms but not others.
Most kernels go with option 2 and have a bunch of 4BSD-derived data structures in macros. Debugging these is a horrific experience.
Option 1.5 is to do what database storage engines tend to do by working with raw byte-oriented data. For that you need to know the key byte slice (e.g. specified by an offset range) so you can do memeq and memhash. It also extends to ordered data in B-trees where you'd support both little-endian and big-endian orderings with fast paths for at least the 1/2/4/8-byte case. Another dimension is statically sized (per collection) vs dynamically sized (per item) data and direct vs indirect storage.
This tends to be my personal preference for general-purpose data structure libraries in C. While all of these options come with trade-offs I've found that when this byte-oriented type-erased approach stops being adequate for my purposes, it's usually time to look at a more specialized data structure design and implementation rather than accepting the trade-offs of any general-purpose library.
These tend to work for "simple" keys but have a tendency to break randomly on more complex keys when using -O2 and -O3 due to padding. You need to either use the packed attribute or name the padding bytes and make sure they are initialised to zero.
They also don’t work if the key is a more complex data structure. Variable-length strings, either as C strings or structures containing pointer to byte data, for example. Strings are common keys, but for the flattened structure you need to intern the strings first and that means you need a hash table to make your hash table work.
The problem with a standard library hash table in C is that you have to pick one of two approaches
Two serious questions. Assume I know nothing about this. (I know next to nothing, and I'm genuinely curious. By coincidence, I've been trying to learn C properly after several years of amateur programming with higher-level languages.)
C doesn't have traits/typeclasses nor anything like generics, so the options are limited.
The simplest hash tables are monomorphic, meaning they're designed for a specific pair of types --mapping ints to some struct, for example. Maybe the int is a unique ID for each entry, and you know some properties of that ID (like that they start counting up from zero, or that they will never exceed INT_MAX). This allows you to produce a suitable hash-function (will there be collisions? How do you handle those?).
But all of this depends on your specific use-case; it's not an option for a library aiming for general use.
That's why in the POSIX hash table (search.h), you see void * everywhere. In C, void * simply erases type information and lets you call any function with an argument of any type. Now it's polymorphic¹, but it's up to the programmer to make sure they only ever call any of these functions with a pointer of the correct type, everywhere.
And sure, you could write wrapper functions that do nothing but cast your data to the correct type and only use those, but the functions in search.h will always be around as a footgun for the other members of your team. Besides, if you're writing wrappers with names like initialise, insert, lookup... you're half-way to implementing your own hash-table anyway.
This is likely why the culture of C programming is to write your own data structures to fit your exact needs.
1: Notably, the search.h hash-table is only half-polymorphic, like the Python dict: the items can have whatever type you want, but the keys are always char * (C) or str (Python). Some languages (e.g. Rust, Haskell) have hash-tables in their standard libraries that are polymorphic on both keys and items.
the items can have whatever type you want, but the keys are always char * (C) or str (Python)
You're thinking of JS's Map JS objects. Python dict keys are anything with __hash__ and __eq__ methods.
Thanks. That's all very helpful. If I follow, then I suppose that C programmers are likely to learn general rules for writing hash tables and then create ones tailored for specific keys and values each time they need one. Your last comment seems to bear that out.
This is likely why the culture of C programming is to write your own data structures to fit your exact needs.
Maybe that's why C programmers tend to say things like "rolling your own hash table is simple." They're so used to doing it that they can't see the difficulties any longer. (Also possible: it is simple, and I'm just dense.)
Maybe that's why C programmers tend to say things like "rolling your own hash table is simple." They're so used to doing it that they can't see the difficulties any longer.
Maybe! Another consideration: C in many ways is a language that forces the programmer to care about minor details like memory layout -- the trade-off being that they get fine control over these things. (I know "you can" do these things in Haskell or whatever, but it's not going to be idiomatic code.) If you're choosing C then you probably want that control, and therefore it doesn't make sense to use stock data structures if they aren't exactly right for your needs.
That said, I don't think most people re-write hash-tables for every application. I know quite a few people who have their own directories of C data structures that they copy-paste from with minor tweaks (types, hash- and comparison-functions, etc.).
For a humourous take on C programming culture, read The Night Watch by James Mickens!
Maybe new and relatively new programmers should not use C.
That's what I wanted to say, until you did. Though what I really think here is that people should learn to code things from scratch in C before using it for real. See, with its lack of generics, C really really encourages custom code for pretty much everything. Stuff that a simple type based code generator would not necessarily capture.
And even though C's syntax is simple enough that is is possible (though not trivial) to write a full featured template system through a pre-processor with actually nice error messages, it is not necessarily all that useful given the above.
Then there's the problem of allocation. You could use malloc()/free() under the hood, you need to not forget to destroy your collections before they go out of scope, things like that. Or you add destructors to your pre-processor, an additional step towards C++. A much simpler solution would be to use arena allocators, and pass that to the standard collection. Thing is, the best arena allocator for the job also depends on the application and the target machine. On a 64-bit desktop I may want to reserve 64GiB of contiguous address space, but on an embedded machine without an MMU my options are more limited.
Whatever you put in the standard library, will have to be some sort of lukewarm compromise. Okay for casual use, but no one working on a serious program will use that. I mean, I've worked in C++ for almost 20 years by now, and the last couple months I've got so sick of the STL, that I've decided to avoid it going forward. The compile times alone make it unreasonable for any serious work.
(Putting this in a reply to myself because I can't edit my original question any longer—a feature of Lobsters that I hate, though maybe I don't understand all the reasoning behind it.)
For people like me who are interested in how to implement arrays or hashes in C, here are some resources I found.
the 99% is pulled out of thin air, but I think a couple of the other comments in this thread support me. A simple addition such as we're talking about here won't satisfy everybody, that's a given, but I'd also argue that it won't satisfy many or even most people. So, with that in mind, I suspect that all you'd be doing is catering to beginners, and that seems unecessary. Let beginners cut their teeth on Rust, golang, or C++ for a friendlier more approachable experience in a system programming language.
While C is a simple language in terms of syntax and stdlib, I think you'd agree, that doesn't make it a simple language to use in practice. I don't believe that adding a simple, beginner friendly, beginner focused, hash table or dynamic array to stdlib addreses the actual difficulties one encounters once they're situated with the language. I think the best way to address the difficulty of C in practice is to remove some of the footguns that are there today instead of, arguably, adding more. Make the language safer, for some definition of "safer", without introducing added complexity, and I think you'll have a more beginner tolerant language.
I also disagree that "reasonably efficient" hashtables in particular are not simple for beginners to roll themselves. Obviously, "reasonably efficient" is open to interpretation, and the application is clearly a factor, but a simple chained hash table? If you're using C already, beginner or not, I don't think it's unreasonable to expect you to be able to bang that out without too much trouble. Anything beyond that is probably more specialized that anything you'll find in stdlib today, hence, probably doesn't need to be there.
I don't think the original article addresses footguns specifically, but at a glance the items mentioned absolutely scratch an itch that I've had with C. And I think Walter's made other observations that are very much inline with improvements that everybody can benefit from, including beginners.
While it's not part of the Standard C Library, hcreate(), hsearch() and hdestroy() are mandated by POSIX. So there you go.
Don't ever use those, it's a terrible API. Like, once you want a second hashtable in your program, you have to use something else.
Just fork a helper process for each hashtable, easy.
j/k of course. I wonder how something so limited ended up in POSIX. Must be some interesting history behind it. Probably used by some early tooling. Linker maybe?
j/k of course.
They were most probably not joking. I'm guessing it's the same philosophy that guided the design of the original Lex and Yacc, which assumes any given program will have at most one one parser. They would absolutely tell you to just fork and have one address space per parser.
For all the bad OOP did to the world, there is one good thing about it: it popularised the (older) idea that we can make several instances of stuff.
:D
I wrote code that handled needing more stack by catch stack overflow and launching a new thread to continue evaluation :D
From the looks of it, this managed a single global hash table, which strikes me as a really unpleasant API for nontrivial programs.
Thank you for that. I'm sure that's very much what @dmpk2k is looking for. I've never worked with posix (directly), I've never heard of nor encountered those routines. My C/C++ life was exclusively Windows, and it's been a while. However, judging from @kryptiskt's and @quasi_qua_quasi's comments below, I suspect that my argument against adding similar routines to stdlib still stands.
If you need a hashtable in C today, you're probably better off rolling your own. If you need something more sophisticated or robust, then use a library. The efforts of a standards body are better spent improving the language and the tooling.
It's subtle, but no standard C function allocates memory (or mentions if a function must/must not allocate memory) except for malloc(), calloc() and realloc(). It's probably intentional by this point.
no standard C function allocates memory (or mentions if a function must/must not allocate memory) except for malloc(), calloc() and realloc()
aligned_alloc(), cnd_init(), thrd_create() (C11), strdup(), strndup() (C23)
I haven't looked much past C99 (that's about all I need for C), but I don't see where cnd_init() or thdr_create() allocate memory (maybe behind the scenes, quite possibly how fopen() might do that) but alligned_alloc() sure. As for strdup() and strndup(), nice. It's about time.
Most well-established projects have these data structures in a library that the whole codebase is standardized on, and even if not, a copy of the implementation can easily be grabbed from, e.g., libbsd, so while it would be nice it's not that big of a problem IMO.
I started my career writing C, and then a bit of C++, but I haven’t touched either language in anger in at least 20 years.
The other day I was reorganising some code and I realised the main loop was in the wrong place - at the top. So I moved it to the bottom where it belonged.
And then I asked myself, why? Why does it feel wrong to have it at the top? Doesn’t it make more sense to read top down? I really couldn’t understand why it felt so wrong.
And today I read this article, and now I remember why. Ugh!
I'd rather see C solve problems particular to the niche where C is the best fit. I'd like a solution to the reverse of what varargs does, i.e. be able to dynamically construct the call stack for a function. There is libffi from gcc but you're reliant on that knowing the ABI for your compiler/architecture.
Forward declarations make sense when you consider the original requirement for a C compiler to be able to work in a single pass. It's never especially bothered me and I suspect that changing that would be hard to define in a way that doesn't break a lot of old macro trickery in old code.
This sounds like it’s saying that D has compile time evaluation features like zig, but in a way that’s not super clear. I’d rather just have an advert for D’s features with comparison.
We D users would probably say Zig has compile time evaluation like D, since D has had it since at least 2009 and i'm pretty sure quite some time before that too (I just still have a copy of the compiler from back then to test right now...).
But in D, you just write any ordinary function, and if you use it to initialize a value that must be known at compile time, it will interpret the function to get its result. Basic example:
int foo() {
int b = 0;
return b + 5;
}
static int a = foo();
Ordinary function, you can use local variables and everything. Then, when using it to initialize the static variable, something that needs to be done at compile time, it evaluates it. So, a's initial value will be the same as if you wrote a literal 5 there. If you tried to do something that isn't possible at compile time, it will error out:
import std.file;
int foo(bool makeItImpossible) {
if(makeItImpossible)
return (cast(int[]) std.file.read("some_file"))[0];
else
return 5;
}
static int a = foo(false); // it works here
static int b = foo(true); // but this errors out
The message will look like:
file.d(369): Error: `open64` cannot be interpreted at compile time, because it has no available source code
ctfe_test.d(9): called from here: `foo(true)`
Notice how it doesn't error until it tries to interpret something impossible at compile time, it isn't something based on annotations or whatever, as long as the branch you actually tried to call works, you'll be ok.
Important to know the compiler never opportunistically tries to run at compile time - it only ever attempts it in a situation where the value must be known at compile time, and if the interpretation fails or the function throws an exception, it is a compile-time error.
CTFE can do a lot of things and return most objects.... but note it is kinda slow. I see a lot of D programmers use it just because they can rather than because they should, and this can hurt compile time a lot. But it also has some nice advantages, in that you can couple it with reflection and code generation to, for example, auto-generate html forms and their server side validation functions based on struct definitions and many other things like that.
This article talks about the ImportC feature, which parses C code and pastes them into a D abstract syntax tree, so it can use C syntax with most of the D features on the inside. tbh it is a mixed bag in practice, but that's why Walter wrote this article - he just realized pasting the C objects into the existing D code enabled all this stuff and thought that was kinda cool.