Without the futex, it's futile

32 points by eatonphil


david_chisnall

If you’re going to talk about futex, you should also look at FreeBSD’s _umtx_op, which is a quite different point in the design space. There are two important and quite separable aspects to a futex.

The first is the ability to do the fast path without any scheduler interaction. This is really important for a lot of places where the uncontended path is common. There is some disagreement over whether this is actually important: some people argue that a mostly uncontended lock is a sign that you are using the wrong primitive (if it’s uncontended, a lock-free data structure might be a better option). I’m not really convinced by that argument. With something like a futex, you can do the fast path entirely with atomic operations and only do the system call if you are likely to need to sleep (you can’t guarantee that you will sleep, another thread may remove the condition that requires blocking on your way into the futex wait call).

The other aspect of the futex is to provide a generic primitive. The wait call is an atomic compare and wait. You can use it fairly easily to turn a spinlock into a sleepable mutex, but you can use it in a load of other places as well. For example, if you want a single-producer, single-consumer ring buffer, you can use a pair of free-running counters for the producer and consumer, with the free space identified by subtracting one from the other, If the queue is not full, the producer can make progress, if the queue is not empty, the consumer can make progress. If the queue is full, the producer can do a futex wait on the consumer counter to be notified that it can proceed. The consumer needs only to do a futex wake when the queue transitions from full to non-full, otherwise it’s just doing an atomic increment.

The downside of a futex is that some primitives are hard to implement in a single word. The _umtx_op semaphores and condition variables are much simpler to use than the futex equivalents, at the expense of generality. In Linux, futex has grown a load of special cases while retaining the property that it operates on a single 32-bit value. For example, it implements priority inheritance by having the kernel know that the low bits contain the thread ID of the thread to boost, which removes some generality. It also has some custom support do an exciting structure where a 32-bit word is 32 locks that can be acquired atomically as a set (which has some nice properties for avoiding lock-order reversals, but some terrible properties for caches).

I implemented futexes for CHERIoT RTOS because the scheduler is intended to be simple, and so moving all of the code for locks except an atomic compare and sleep primitive out is useful. Futexes are the only blocking primitive that we expose, we map interrupts to counters that are incremented when the interrupt fires, so you wait for hardware and software events in exactly the same way. We also have a multiwaiter API to wait for more than one futex at a time, so you can multiplex hardware and software events into the same blocking / polling call.

matklad

Allow me to hijack this for an ask: what is a good resource to learn about hardware memory models? I have a somewhat adequate for practice grip on the C++ memory models, and I understand what it means in terms of semantics of the program expressed in the source language.

What I don’t understand is the memory model provided by ISA, and how exactly compiler connects the two. So I am looking for something that would explain what x86 does, what arm does, and what llvm does with all of that.