Without the futex, it's futile

32 points by eatonphil

If you’re going to talk about futex, you should also look at FreeBSD’s _umtx_op, which is a quite different point in the design space. There are two important and quite separable aspects to a futex.

The first is the ability to do the fast path without any scheduler interaction. This is really important for a lot of places where the uncontended path is common. There is some disagreement over whether this is actually important: some people argue that a mostly uncontended lock is a sign that you are using the wrong primitive (if it’s uncontended, a lock-free data structure might be a better option). I’m not really convinced by that argument. With something like a futex, you can do the fast path entirely with atomic operations and only do the system call if you are likely to need to sleep (you can’t guarantee that you will sleep, another thread may remove the condition that requires blocking on your way into the futex wait call).

The other aspect of the futex is to provide a generic primitive. The wait call is an atomic compare and wait. You can use it fairly easily to turn a spinlock into a sleepable mutex, but you can use it in a load of other places as well. For example, if you want a single-producer, single-consumer ring buffer, you can use a pair of free-running counters for the producer and consumer, with the free space identified by subtracting one from the other, If the queue is not full, the producer can make progress, if the queue is not empty, the consumer can make progress. If the queue is full, the producer can do a futex wait on the consumer counter to be notified that it can proceed. The consumer needs only to do a futex wake when the queue transitions from full to non-full, otherwise it’s just doing an atomic increment.

The downside of a futex is that some primitives are hard to implement in a single word. The _umtx_op semaphores and condition variables are much simpler to use than the futex equivalents, at the expense of generality. In Linux, futex has grown a load of special cases while retaining the property that it operates on a single 32-bit value. For example, it implements priority inheritance by having the kernel know that the low bits contain the thread ID of the thread to boost, which removes some generality. It also has some custom support do an exciting structure where a 32-bit word is 32 locks that can be acquired atomically as a set (which has some nice properties for avoiding lock-order reversals, but some terrible properties for caches).

I implemented futexes for CHERIoT RTOS because the scheduler is intended to be simple, and so moving all of the code for locks except an atomic compare and sleep primitive out is useful. Futexes are the only blocking primitive that we expose, we map interrupts to counters that are incremented when the interrupt fires, so you wait for hardware and software events in exactly the same way. We also have a multiwaiter API to wait for more than one futex at a time, so you can multiplex hardware and software events into the same blocking / polling call.

matklad

Allow me to hijack this for an ask: what is a good resource to learn about hardware memory models? I have a somewhat adequate for practice grip on the C++ memory models, and I understand what it means in terms of semantics of the program expressed in the source language.

What I don’t understand is the memory model provided by ISA, and how exactly compiler connects the two. So I am looking for something that would explain what x86 does, what arm does, and what llvm does with all of that.

pervognsen

The relevant papers by the Cambridge group are probably the best resource: https://www.cl.cam.ac.uk/~pes20/papers/topics.html#relaxed_all. As entry points, you could start with http://www.cl.cam.ac.uk/~pes20/armv8-mca/armv8-mca-draft.pdf on ARMv8’s MCA model and http://www.cl.cam.ac.uk/~pes20/weakmemory/cacm.pdf on x86’s TSO model. As for how the LLVM connects the two, I don’t think you’re going to find much specifically on LLVM. But there’s https://www.cl.cam.ac.uk/~pes20/CompCertTSO/doc/paper-long.pdf on a version of CompCert targeting x86-TSO.
- matklad
  
  Thanks, the x86 paper was a great read!
  - matklad
    
    In particular, this was the major conceptual leap for me:
    
    For a processor architecture, we prefer to define a memory model that is applicable to arbitrary programs, to support reasoning about low-level code, and have results about well-behaved programs as theorems above it.
    
    This makes the entire thing sooo much clearer! When we talk about programming languages, we ascribe semantics only to data-race free programs, leaving everything else UB. But for the CPU we we ascribe semantics to every program!
    
    We really are doing different things when describing memory model of CPU vs language!
    
    pervognsen
    
    As a nitpick, there are still lots of things in an architectural specification that can be (and usually are) left undefined in some sense. But you’re right that the kind of undefined behavior found in C is absent.
    
    Regarding how best to understand C-style UB, I appreciate the approach you see explained in this old paper on CompCert (and which corresponds to how UB works in an executable specification like MiniRust): http://gallium.inria.fr/~xleroy/publi/compcert-backend.pdf. That is, UB is a perfectly well-defined, observable, terminating event in the language’s abstract machine or source-level semantics. What’s special about the C or Rust style of UB is that we only require correct language implementations (i.e. compilers and interpreters) to be semantics-preserving for “good” behaviors, i.e. behaviors that don’t lead to a UB event in the abstract machine execution. All of this is explained in a self-contained way in the first few pages of the paper, and I found it helpful to clear up some conceptual explanation I had when I first read the paper years ago.
    
    matklad
    
    That is, UB is a perfectly well-defined, observable, terminating event in the language’s abstract machine or source-level semantics. What’s special about the C or Rust style of UB is that we only require correct language implementations (i.e. compilers and interpreters) to be semantics-preserving for “good” behaviors
    
    Oh, but this is how I always explain UB!
    
    https://lobste.rs/s/omasxh/there_is_no_memory_safety_without_thread#c_1cvrb0
    
    It is useful to now have a paper to point to, thanks!
    
    orib
    
    The best I’ve seen is here: http://www.ai.mit.edu/projects/aries/papers/consistency/computer_29_12_dec1996_p66.pdf
    
    fanf
    
    I’m suspicious of papers about memory consistency that date from before the C11 memory model, and especially those that predate Boehm’s “threads cannot be implemented as a library”. However based on a brief look that paper seems to be a very good summary of the state of the art in the 1990s, which is roughly what the Linux kernel memory model is based on.
    
    orib
    
    The value of this paper is that it gives a hardware based intuition for what optimizations lead to the various memory models. Since the paper was published, things have largely gotten stricter, and hardware has gotten more principled.
    
    It’s certainly not the complete story, but it’s still one of the better introductions. It provides a good framework on which to hang the information in your processor reference manuals.
    
    matklad
    
    Thanks, that’s indeed an excellent read, and exactly the genre I am asking for (but, yes, tells only part of the story and leaves one to wonder where are we 30 years later).
    
    Moonchild
    
    per links are good, i also rather like this talk by olivier giroux which i think gives some really good perspective kind of at the hw-sw intersection
    
    Moonchild
    
    couple of other links that may be of interest
    
    https://people.mpi-sws.org/~viktor/papers/popl2015-c11comp.pdf
    
    https://github.com/weakmemory/imm (not read or looked closely at but it is from the usual suspects so i assume it is legit)
    
    the short answer about what llvm does with concurrent operations is not fundamentally all that different from what it does with any other operations—c++ has a semantics, llvm ir has a semantics, and the various machines codes have their semantics, and you have various semantics-preserving transformations between and within those. concurrency is certainly not the only area where the specs are nondeterministic. they do tend to ‘tread lightly’ on concurrent code as far as optimisations are concerned, because it is so touchy. despite all the hubbub, llvm doesn’t really think about it much more than it would about these snippets wrapped in __asm__ blocks; the important parts are just that it 1) passes them through in a particular guaranteed way that everybody can agree on and 2) promises not to break them.
    
    matklad
    
    Thanks a lot for this explanation, I think it actually closes the last gap I had!
    
    it is from the usual suspects
    
    Ha, lol, I studied with Anton Podkopaev at the Uni!
    
    abareplace
    
    See these blog posts:
    
    https://preshing.com/20120930/weak-vs-strong-memory-models/
    
    https://preshing.com/20121019/this-is-why-they-call-it-a-weakly-ordered-cpu/
    
    https://preshing.com/20120625/memory-ordering-at-compile-time/
    
    https://preshing.com/20120710/memory-barriers-are-like-source-control-operations/
  - ChrisDenton
    
    I’d just add the note that on Windows the futex functions (WaitOnAddress, WakeByAddressSingle, WaitByAddressAll) aren’t the lowest level primitive and are themselves implemented in terms of the lower-level NtWaitForAlertByThreadId and NtAlertThreadByThreadIdEx (i.e. park thread and unpark thread). That said, unfortunately these aren’t documented so shouldn’t be used directly but they are used by the OS to implement other types of synchronisation.
    
    david_chisnall
    
    That’s very interesting and explains why the Windows version is process local. Most operating systems have global and process-local models. Typically, when you do the wait call, you have to acquire a kernel lock that serialises wakes on the corresponding memory location, then to the compare, put the thread on a sleep queue, and release the lock. Making this process local is nice for two reasons. First, the contention on the lock is simpler. Second, the lock key can be the virtual address, whereas for futexes in shared memory it must be a physical address (and must either pin the page or handle page-out / compaction in a special way).
    
    If it’s implemented on a simple wake-thread primitive, then all of this must happen inside the process, so is intrinsically tied to the process.
    
    This is a shame because futex-like operations are very useful between processes.
  - kprotty
    
    In the “Minimizing system calls” section, it mentions:
    
    The second case leads to a spurious wake-up. The woken thread checks the futex again, sees it’s locked, and goes back to sleep, but that’s two extra system calls.
    
    Its really ~n extra syscalls rather than two, failing for most waiters after a new one comes in or an old one leaves when acquiring. Under load, this generates ~max(waiters - 1) spurious wakes for every lock cycle and degrades into a spinlock, at least anecdotally.
    
    Could instead do what musl does: moving the counter to a separate u32 and having the futex word be the lock state. But most mutexes built on futex (glibc/ntpl, rust, zig) use a 3-state futex word (unlocked, locked, waiting), where a waiter acquires via unlocked -> waiting. Minimizes to only an extra wake once theres no longer any (new) waiters.
    
    viega
    
    I’m not sure I understand; why would it be n threads waking? Only one gets woken, sees the mutex is still locked, and goes back to sleep. When the new lock holder unlocks, it wakes up only one more thread.
    
    kprotty
    
    “after a new one comes in”
    
    Thread 0 fails to acquire, bumps wait count to 1, and tries to wait
    
    Thread 1 fails to acquire, bumps wait count to 2, and tries to wait. Thread 0 spuriously wakes (fails to wait) as wait count is now 2 instead of 1. It tries to wait again with count 2
    
    Thread 2 fails to acquire, same dance. Threads 0 and 1 spuriously wake for the same reason, then try to wait again with count 3
    
    “or an old one leaves”
    
    lock released & wakes Thread 2, who acquires it and wants to dec wait count
    
    Thread 3, 4, 5 fail to acquire, bump wait count, etc.
    
    Thread 2 decrements wait count, spuriously waking Thread 3 & 5
    
    viega
    
    The wake count is only used to determine whether a futex wake is done on unlock. Threads that are awoken re-attempt to acquire the lock, but the lock itself is still only one bit.
    
    So unless you wake multiple threads when unlocking, that does not happen.
    
    fanf
    
    A futex doesn’t wake threads whenever it changes: one of the active threads must explicitly ask the kernel to wake a thread, and you aren’t going to write your mutex so that it wakes threads when a thread needs to put itself to sleep.
    
    kprotty
    
    futex_wait takes a ptr & expected value. After acquiring its internal lock, it compares the data at the ptr to the expected value. If it differs, then it bails on suspending the thread. Here, the data at the ptr changes from expected on each wait_count inc & dec, thus causing a bunch of “spurious wakes” re:waiting.
    
    viega
    
    A single thread trying to go to sleep can end up contending for the lock because they don’t have the right wait count, and that is a reason to move the wake count into its own space (I do that in my own mutex that I actually use day to day).
    
    So that I get, but that doesn’t seem to be your main point– you seem to be saying multiple threads will awake? Each unlock of the mutex results in AT MOST 1 futex wake operation, and such a call only wakes one waiter the way it’s called in the article.
    
    fanf
    
    The expected value is used to protect against a sleep/wakeup race when a thread is trying to sleep on a contended futex. It can prevent the thread from going to sleep when it loses the race, but it doesn’t cause any other threads to wake up. There are different futex ops for sleeping and waking, and a sleep op doesn’t cause wakeups.
    
    kprotty
    
    Preventing the thread from going to sleep here is considered a spurious wake (return from futex_wait without a corresponding futex_wake). The update of the counter from 1 thread here can cause other threads to “spurious wake” when trying to sleep with another expected count, as noted in the example.
    
    fanf
    
    Oh I see! It wasn’t clear from your description whether the waits succeeded or not. I thought you meant that thread 0 actually went to sleep then was spuriously woken up when thread 1 arrived (etc.). I didn’t realise “fails to wait” was supposed to mean “loses the race”, I thought it meant “fails to continue waiting”.
    
    viega
    
    It’s definitely confusing to call it a wake when the OS isn’t putting you to sleep, but you are making the system call, so it does make sense. Thanks again!
    
    tobin_baker
    
    Came to this thread late, but here are some lesser-known but good recent textbooks on shared-memory concurrency:
    
    Hardware: A Primer on Memory Consistency and Cache Coherence
    
    Software: Shared-Memory Synchronization
    
    the-lazy-guy
    
    I appreciate that author showed how to build fast mutexes from futex primitives (though, Ulrich Drepper did it in 2004 already - https://dept-info.labri.fr/~denis/Enseignement/2008-IR/Articles/01-futex.pdf), I don’t like their bashing of the book.
    
    There is much more to concurrent programming then how to implement mutexes efficiently - memory models, starvation, live and deadlock, priority inversion, various consistency models, ABA and myriad other things (which I probably never learned). And the book covers all those things quite well. Which is a lot for a course textbook. While building fast synchronization primitives is very important, it is less important than correctnes, so all of the things above should have higher importance than explaining futexes.
    
    Most people writing concurrent software, even pretty highly contested, can get away with whatever synchronization primitives their language and/or system library provides. In my professional life, I have seen only once when using futexes directly (vs pretty standard primitives built on top of them) was justified by performance requirements, rather than developer’s desire to show how cool they are.
    
    P.S. I googled a little, and it seems that prior to futexes, Solaris implemented thread parking with lwp_park/lwp_unpark which are less general than futexes, but also allow to avoid syscalls when waiting is not necessary. Could also cover those vs futex?
    
    mjb
    
    I don’t like their bashing of the book.
    
    It’s a good book, but it is an important oversight. From a systems perspective, being able to separate the tasks of locking and scheduling is super valuable. Being able to pass the scheduling part over to the OS, which is most empowered to deal with it in a multi-process world, without pushing locking into the OS, is almost required to get good performance and multi-core scalability.
    
    Seems like a reasonable criticism to me.
    
    viega
    
    As the author of the article, maybe I spent too much time on the mutex as an example, but it is just one of many examples where the modern world isn’t even mentioned.
    
    I thought it was good to step through one issue it overlooked well, to give people a sense of it, but I did note a few other things I consider significant omissions. So many practical considerations are either obtusely mentioned, or skipped, as is anything that would go too deep under the hood. So who’s the book for?
    
    One high level example: it talks a bit about queueing algorithms, and it mentions that threads don’t need to be 1:1 mapped to a processor. I don’t think the book even acknowledges “Go’s” existence when it talks about mapping threads to processors, and doesn’t say anything particularly meaningful about tradeoffs.
    
    To that end, even if async primitives are used essentially as green threads in many environments, it’s still a concurrency approach that works just fine in a multi-processor world, and there’s… no mention.
    
    I also will say, if there’s a reference to performant IO multiplexing (something like epoll(), not even expecting something more modern like io_uring), I don’t remember it. And searching for the words “poll” and “epoll” both return 0 hits.
    
    And finally, as someone else here said, the concerns about atomics (memory ordering) aren’t discussed in any way that’s practically relevant. I am befuddled by some of the choices they make with non-blocking algorithms too.
    
    To me, some of those points I can see arguing. But the futex is currently so fundamental to modern concurrency, the fact that, in a recent book, it doesn’t even merit a mention anywhere, despite being in every major OS for a decade or more.
    
    It’s indicative of the overall problems, without having to drill into every little thing I didn’t like.
    
    the-lazy-guy
    
    I think omission of Go, i/o multiplexing and async are completely fine and justified.
    
    Go is not special with it’s green threads, that is just a mechanism to implement context-switching in user-space (to be able to integrate with i/o event-loop and have somewhat faster channels). It actually does not invalidate any of techniques described in the book. Go tries really hard to pretend that goroutines are indistinguishible from real threads, as long as you don’t call external code or invoke any syscalls. Java with its project Loom is veryc lose.
    
    I agree that async/await is a concurrency paradigm. But it is a form of cooperative concurrency (since you choose when to yield) which is a completely different beast from multithreaded/multiprocessor concurrency and pretty much nothing described in the book is relevant in that context. async/await as a concept is actually younger than the first edition of this book (I didn’t read the second). Maybe if they wrote book today, they would have called it “Art of multithreaded programming”, to distinguish it. Maybe not.
    
    I/O multiplexing is orthogonal to concurrency. You can combine them, but you don’t have to. You can have perfectly sequential program using (e)poll or io_uring. Moreover, most concurrent frameworks (including aforementioned Go and languagues with async/await) try to hide that i/o is multiplexed, giving you an illusion of blocking i/o. In this sense it is like futexes - an important building block that is completely hidden by useful abstractions on top.
    
    3a. Though, both go and async/await frameworks give you some form of select operation for multiplexing beyond just i/o. That is something valuable that the book doesn’t discuss. (“Concurrent programming in ML” - does, it is buit entirely on CSP paradigm)
    
    I think you overstate importance of futexes. Especially for application programmers. It is like saying that university textbook on software design that doesn’t explain how modern thread-caching allocators work internally is bad. Thread caching allocators are fundamental to modern software (whether it’s good or bad - debatable), but you can write massive amount of code just knowing relevant APIs and some general guidance.
    
    UPD:
    
    But the futex is currently so fundamental to modern concurrency, the fact that, in a recent book, it doesn’t even merit a mention anywhere, despite being in every major OS for a decade or more.
    
    Maybe I will agree with you (and matklad) in that book is less adequate in 2020 than it was in 2008. 2nd edition looks like light brush up without any new content. But neither in 2008 nor now it was intended as a comprehensive all-around handbook on concurrent programming. It is first and foremost an introductory text on concurrent programming for university students.
    
    matklad
    
    It is first and foremost an introductory text on concurrent programming for university students.
    
    I this was my university introductory text around 2013! I can confidently say that it sucked in this role :-) it’s a very long book, with relatively low density of key insights. It’s a good read if you have nothing better to do, but it’s definitely possible to write much shorter, much more useful text (though, I don’t know of such a text). After reading that book, I felt that I’ve learned something, but not much, and was confused as to what I’ve actually learned. It was only years later, when I learned everything I haden’t got from the art, that I understood just how underwhelming the book is.
    
    It’s starts with “fundamentals”, and while some ideas there are indeed key (sequential consistency, linearizability, lock/wait/obstruction freedom, quiescence), a lot feels like needless filler, like that whole stuff about registers.
    
    And than the bulk of the book is “practice”, which is mostly a laundry list of classical and obscure-seeming data structures, which you sort of just glaze over, if this is your introductory text. I did learn about ABA, and about “helping” for lock free algorithms from this (one thread finishes another’s operation), but that’s all I remember now from those 300 pages?
    
    In contrast, there’s a whole bunch of ideas I’ve learned elsewhere:
    
    memory models! Grocked that from https://shipilev.net/talks/narnia-2555-jmm-pragmatics-en.pdf and then https://preshing.com/archives/.
    
    that at some point you need to tell the kernel “hey, I am waiting” I don’t remember when I understood that the kernel actually needs to be involved, it was not long before https://matklad.github.io/2020/01/02/spinlocks-considered-harmful I think. This is absolutely major part of the mental model, that you need to cooperate with the scheduler, that you can be runnable or sleeping. And I don’t think the Art teaches about that at all? So I was surprised that you mentioned priority inversion, because you can’t teach about priority inversion if you don’t explain the role of the kernel and the scheduler. And the art doesn’t explain inversion! It is mentioned once, at the end, in the intro section for transactional memory, as an argument for why locking sucks, which would be absolutely impenetrable to the reader who is new to the topic.
    
    Really, memory model + waiting (=futex) is the entire basis of the whole field, and the book teaches neither.
    
    Depth vs total work. I got that from reading Cilk materials: like when you parallelize merge sort, you get a bit of speed up, because merge is sequential bottleneck, but then you make merge parallel, increasing the total amount of work, but unlocking parallelism and finishing faster. That’s absolutely brilliant central result that explains how to do parallel programming (and programming in general, it’s the most important algorithm for managers), and, it’s not exactly misssing from the book, they discuss depth/total work in the work-stealing chapter, but they use fibonacci example, of all the things! The sorting discusses sorting networks!
    
    The concurrency-not-parallelism and IO. Maybe it’s out of scope for the book that discusses primarily parallelism, but, if that’s an intro book, it should discuss both, otherwise the student will be confused.
    
    And, from a university book, I’d definitely expect to see the holistic picture of devices+kernle+userspace. I feel like I still am missing this one, it was only last year that I realized that the kernel level IO operations are blocking (your thread blocks and switches away while it’s in the middle of kernel stack)
    
    cliff click concurrent hash map (https://preshing.com/20130605/the-worlds-simplest-lock-free-hash-table/)? Feels like the one non-stack/non-queue data structure to know! (Though, ok, this one was concurrent with the book)
    
    what I asked in other comment: how hardware MESI + ISA + Compiler + source language memory model actually connect to each other. Still waiting for a book about his one!
    
    the-lazy-guy
    
    Thank you. This in my opinion is a much better criticism of the book.
    
    This book was not used as introductory text in my university. I read it on my own and it was much better than the course on the topic that we had. “Fundamentals” is the best part of the book. Though, I found datastructures and section also useful, because they did demonstrate how to put things described in “fundamentals” to some practical use. I was aware how to use coarse locking before reading the book, but it did show to me fine-grained locking for the first time, and I didn’t see lock-free datastructures before that. (btw, they do have overly-complicated lock-free hash table example in the book; giving example of non-resizable lock-free table first like Cliff Click’s would have been better)
    
    I think better introductory texts just do not exist. You were left confused after reading this book, because concurrent programming is hard and to understand anything one has to read probably 10x relative to the book and to actually write (and debug!) some real code. I am still confused about many things.
    
    I did not really understand C++ memory model until I had bugs because of incorrecty placed acquire/release accesses. I probably still don’t really understand it beyond simple patterns.
    
    On the holistic approach. As I said earlier, I do not expect from university book on C++ or datastructures a holistic view of how thread-caching allocator work all the way down to mmap and setup of page tables in the kernel. While it is fundamental for anything using heap, we can always abstract it behind new/delete API. Same way the book in pracical part stops at the 2 abstractions from java.util.concurrent - mutex and condvar. It shows that they are sufficient to build anything - giving examples like reentrant mutexes or rwlocks (but IIRC mentions existence of better implementations in standard library).
    
    PS. On coverage of priority inversion in the book: my memory must have lied to me. I probably learned this concept from other sources around same tme.
    
    viega
    
    He and I are definitely making the same points. I feel I was very clear in the article that it is unclear, given it’s a senior/graduate text, whether they’re targeting future practitioners or future academics because they fail at both. But it does seem to be trying to address both.
    
    Academic books like this usually are expected to sketch the key concepts underpinning the world. This kind of intro should be peeling back every curtain long enough for people to get a glimpse, to see if it’s something they’re interested in exploring more. What’s it preparing people for?? As matklad illustrated, if you’re going into the real world, not much, perhaps “confusion”.
    
    The futex is one of many examples of where it’s inexcusable for any intro book with such a recent copyright date to not at least mention it. There is almost nothing outside that copyright date that would lead me to believe this book was even written in this millennium.
    
    And it’s one thing to say for things like async “well, that’s not multiprocessor, so it doesn’t need to be covered”, but to not even help people understand about the overlapping landscape? It’s not even just async, on the other side too– they totally ignore that distributed systems is a field at all. Obviously, I wouldn’t necessarily expect them to go deep into consensus algorithms, but knowing how the fields are related, and sketching how they are different seems like it should be important to cover.
    
    fanf
    
    The book has three chapters on how to implement various kinds of mutex, so it’s reasonable to expect it to show how to do so using modern OS primitives.
    
    viega
    
    If the book is for aspiring application programmers, then the things you say are not important become far more important to teach. If it’s not, then which building blocks work well, and which ones don’t should be there. If the book was aiming at the practitioner, and not attempting to cover any fundamentals, then I could get behind skipping the futex.
    
    However, the first third of the book is all about fundamentals, and it certainly isn’t doing a good job if it’s aimed at practitioners.
    
    As for Go, while the pieces weren’t novel, the assembly made it so. Usually green threads map 1:1 or n:1; having such a well supported m:n environment is a really interesting case study. That’s before talking about channels, where I think the subtle API problems are pretty informative.
    
    Honestly, I think the book should have left the authors’ obvious comfort w/ Java, and moved most of the code to Go. That along would have made it 100% better.
    
    As for async/await, the whole notion of doing that kind of multiplexing is much, much older… older than me even (mid-1960’s or so). A search of the book for ‘coroutine’ turns up zero hits.
    
    And knowing how to work with IO in a multi-processor environment seems like such a fundamental thing, you’d think they’d want to cover it, both for end application developers to (for instance) give them an inkling why their local pipe to a subcommand deadlocked, and to help the aspiring academics and owners of the primitives understand the challenges users will face, along with tradeoffs. Just the whole jump in performance moving away from the inefficiency of select() for large applications is pretty crucial, and yet I’m sure most engineers don’t really have much of an inkling of what that would even be.
    
    the-lazy-guy
    
    Honestly, I think the book should have left the authors’ obvious comfort w/ Java, and moved most of the code to Go. That along would have made it 100% better.
    
    How exactly? What would change apart from syntax? While I agree that go would work equally well (apart from not having formal memory model), I don’t see what it would improve.
    
    If the book is for aspiring application programmers, then the things you say are not important become far more important to teach.
    
    Could you clarify which? I said futex api is less important to teach in a university textbook. I gave my list of things which are important to understand, before going for efficiency of implementation.
    
    As for Go, while the pieces weren’t novel, the assembly made it so. Usually green threads map 1:1 or n:1; having such a well supported m:n environment is a really interesting case study. That’s before talking about channels, where I think the subtle API problems are pretty informative.
    
    What exactly would you like to study about Go case? The book is actually not about Java or Java APIs. Java is used as a common teaching language, and bare minimum of it described to be able to demonstrate novel concepts. Which novel concept relative to Java does Go enable? I agreed on channels (and generally select operation for multiplexing; which is doable in java via 3rd party libraries, but certainly looks ugly). But for that style of CSP-like concurrent programming there’s another great text book - “Concurrent programming in ML”. Which is IIRC about as thick as the book we discuss now. I think, they better stay as 2 separate books.
    
    As for async/await, the whole notion of doing that kind of multiplexing is much, much older… older than me even (mid-1960’s or so). A search of the book for ‘coroutine’ turns up zero hits.
    
    Using async/await aka stackless coroutines became mainstream only after C# did it around 2012. Before that using stackful coroutines (aka fibers or green threads) was common in imperative languages. FP had its share of continuation and monad-based ways to model context switches, but afaik also resorted to (green)threads in any practicalish implementations.
    
    The book is called “the art of multiprocessor programming”. I just went to read the introduction and they clearly state that book is about concurrency that is enabled by parallel execution on multi-processor systems. So anyway, everything discussed above is clearly out of scope. What is in scope is how to make code running concurrently at the same time on different processors cooperate.
    
    viega
    
    Anything introing core principles aimed at people looking to go into academia or people considering low-level work in the area in industry, then material should be helping people understand how and why things are the way they are, setting the groundwork for deeper exploration, and thus should definitely cover the futuex, because it’s important.
    
    If you’re skipping the futex, that’s fine, but then it’d better cover the core principles one needs to be able to make good decisions in the real world, which should ideally introduce you to all of the kinds problems you are likely to face.
    
    There, this book does particularly poorly. In Cryptography, academic classes tend to focus on the tools in the toolbox, but not leave people well equipped to know what next steps are if they want to actually build houses. But, they still tend to discuss relevant tools and their considerations. But there’s often less important stuff in there for the practitioner at the expense of the big-picture view of how it’s all safely stitched together.
    
    The authors pick topics almost from a hat, but don’t really expose the most important intricacies of what it takes to be effective programming in such an environment. It’s clear from matklad’s post above that is the case.
    
    For example, as I’ve essentially said, it spends a lot of time on making mutexes, but without covering the bits you’d ACTUALLY want to use from the OS to make one, OR the obvious questions like, “for how long should I spin?”
    
    But as another example, they spend an awful lot of time on a bunch of random data structures, both lock-free and w/ locks, without any strong motivation as to why what they’re covering is valuable in any way, especially when many environments are already going to have those primitives built in, ESPECIALLY Java.
    
    Yet, it does not give people a good framework to decide what tools are best for the job. How should one decide whether to use the lock-free dictionary, or the dictionary with a lock-per-bucket? How do you try to get ahead of likely risks while selecting our building with technology, as opposed to once you’ve shipped it?
    
    Those are the kinds of issues you’d expect to be covered well if the book is good for practitioners. This book does NOT cover that stuff well either.
    
    It didn’t occur to me until reading @matklad’s post, but he’s right that much of the book really is, in my words, fluffy filler without much substance. For such a long book, there’s very little real theory, and very little real practice.
    
    the-lazy-guy
    
    I disagree with you on many points, but I have articulated many of those, and I don’t think we’ll find common ground. Also I feel like you ignore half of my questions. It is better we stop now to save us both some time. Thank you for disucussion. me yesterday
    
    viega
    
    Sorry, I do see a few things scanning the thread, which wasn’t intentional, but I think you’re right that we don’t need to spend the time, as we’re both unlikely to change our stance.
    
    I did appreciate the discussion though.
    
    matklad
    
    And the book covers all those things quite well.
    
    It definitely does not describe memory models to any reasonable detail at all. Java memory model is described in appendix briefly. There seems to be nothing about Acquire/Release model, and nothing about memory models in general, besides half a page that declares that they exist.
    
    the-lazy-guy
    
    I read the book around the time it was out and vividly remember it explaining that Java memory model is not sequentially consistent and what subset of operations is linearizable across threads. And did mention that other relaxed memory models exist (so you don’t assume that all enviroments follow Java rules). I think it gives good enough awareness of existence and importance of memory models for correctness. This is a textbook, not a comprehensive manual on all-things-concurrent-programming. I don’t have second edition, but if they removed that bit, it would be weird.
    
    Acquire/release was quite obscure until Hans Boehm made everyone care about it with inclusion into C++11. Before that people cared about wht their hardware and compiler actually did. For my peers that was x86 and gcc. AFAIK In 2012 gcc still had codegen bugs around acquire/release intrinsics (and consequentially - std::atomics) and the only guaranteed way to get correct code was to resolve to inline assembly for architecture of choice. (also in theory getting suboptimal codegen because it did introduce unneccessary barriers for compiler optimization)
    
    matklad
    
    The three-page sketch of JMM in the first edition of the book was adequate for 2008, as we were only figuring stuff out back then. But that’s definitely not at all adequate today, nor for practice (where even Java has AcqRel now https://docs.oracle.com/en/java/javase/17/docs/api/java.base/java/lang/invoke/VarHandle.AccessMode.html) nor for theory (my understanding is that modern theory is mostly formalization of weak memory models).
    
    the-lazy-guy
    
    You won. Strike memory models out from my list. I still think that the rest is more important than futexes.
    
    viega
    
    Well, like matklad said, there is an appendix on software, where if you go to the Java chapter, there’s a small discussion, that’s not very valuable.
    
    The C/C++ standards have a lot to say on memory order, and the discussion should be very different than in Java. But the appendix’ C++ chapter (there’s no C chapter) doesn’t even mention it.
    
    Modern processors do not make it easy to get sequential consistency, and in fact, in the C/C++ model, last time I checked, the guarantees they’re supposed to provide over Acquire/Release are not achieved. So it’s basically Acquire/Release, but slower. There was a paper on dealing with the problem: https://plv.mpi-sws.org/scfix/full.pdf
    
    All to say, the book definitely doesn’t cover memory ordering well. But, I’ve never seen it covered well, partially because it’s all very complicated.
    
    I also think this was a more minor issue for me, because it’s so poorly understood even by practitioners, but I do agree I find it another of many unfortunate gaps in this book.