Put a ring on it: a lock-free MPMC ring buffer

19 points by viega

walleye

Slightly off topic, but I’d like to push back a little on this statement:

observability is generally less important than availability

Being too rigid either way is going to lead you astray, but an anecdotal argument is that I can (usually) fix a broken system I can observe, but without observability the system can get into a state I can’t fix. As a concrete example, monitoring jobs in Google Borg run at a higher priority than production jobs.

Obvious caveats of not all observability data is created equal and ROI all stand, though.

viega

I qualified it on purpose, partially because I can argue either side with real world examples. Still, I do think the scales tip a bit toward availability, not the least because it translates to dollars.

For instance, on the other side, I know systems that have been so highly available for so long that their operators won’t even consider most observability improvements, especially since they tend to result in availability problems. Think about old financial systems at the backbone of commerce; some process quadrillions of dollar a month, and have been in production for longer than most of us have been alive.

And with security, when availability isn’ a big concern, sure, collect all the events, if you’re comfortable adding to your attack surface. But companies with heavy loads don’t run EDR products on those loads if they are essentially lift-and-shifted from Windows, because in the desktop EDR world, it’s a reasonable assumption that you can capture everything you might want to see.

At scale, it’s not a great assumption, which is why some vendors have moved to ebpf for their Linux solutions. Unfortunately, they will learn that hard way that simply moving data collection to ebpf ,and then relying on the OS to drop events, isn’t enough. Under large load, you’re still adding to that load, even though you’re now allowing it to be thrown away.

moses

This is very cool. Can you explain how it’s different from the LMAX Disruptor? They at least seem to rhyme, if they’re not exactly the same.

viega
I’ve never seen that before, and will take a while to both get around to it, and digest it, as it seems to be be a larger framework. But I’ll definitely get around to it, and when I do I’ll try to let you know.

In a quick scan, I see a ring buffer in there, but it seems it’s used for 1:n or m:1 comms, instead of n:m; though, that could be wrong.

Generally, if it’s natural to use 1:1 ring buffers, you should. If not, and it’s natural to use a 1:n construct, great. But it’ll definitely be slower than the 1:1.

At the end of the day, having an n:m option is a great tool, and I’d probably use it by default if it’s natural to have multiple producers / consumers wanting to give and take. However, I might try to force myself into a 1:n or 1:1 scenario if the volume of data going through it is going to be incredibly large, because, unlike hash tables, any queue is going to go slower when every thread is operating on the same data structure concurrently.

Whereas, with hash tables done right, you can get the performance to scale.

The only little asterisk there: you will take a slight penalty to use a hash table capable of m:n access in a single threaded manner. But it’s not much, and it’s offset by the gains you get with threads, because you can avoid a ton of unnecessary contention in a hash table.

For instance, for a single threaded program on my Mac laptop, a table built explicitly for single threaded use can read about 35.5 million entries a second (if that’s all it’s doing), and write them at 30 million entries a second.

But a good lock-free m:n hash table, but running in a single thread, might typically only do about 25m reads / 20m writes per second, but at two threads, I get 50m reads, 34m writes/sec. And everything scales well from there until you start bumping into memory bus limits…. my laptop tops out around 400m reads per second / 300m writes per second. Given I’ve got 12 performance cores, that’s almost linear per core.

Why are dictionaries so much better? Queues of any kind have challenges dictionaries do not:
1. With a queue, read operations dequeue, thus require committing state changes that can contend.
2. Every operation is contending to change at least one of the head / tail reference.
Exponential backoff is a good tool for minimizing the damage, but typically any practical non-1:1 queue design I’ve ever built or used had peak performance at top out around the single-threaded performance. Maybe occasionally a bit above, but usually something like 20% of the single-thread performance.

Those bottlenecks that limit queue performance are there, whether you use locks, or go lock-free.

Yes, writers can be given some flexibility to spread out contention with other writers, because you can use a fetch-and-add on an index, which is still contention, but it’s more benign than most, as it certainly executes with a proper upper bound on time, as it does not require retries. Unfortunately, after that little trick, you then will still contend within your cell with readers trying to dequeue, and in the case of a ring or stack, other writers as well.

You can do similar things with readers. With an unbounded FIFO, as long as the queue isn’t full, that actually does win you some, since your only potential contention is with the writer who was given the same cell as you. But stacks and rings don’t benefit/

Contrast that to a dictionary:
1. Reads can easily be made to never contend, and do not change state.
2. Modifications can push almost all contention down to the bucket level. And there, the more buckets you have, the less likely contention is.
Even when you resize a dictionary, it doesn’t need to add significant write contention at all.

I guess I should break some of my dictionary code out soon (I have a few different ones that are optimized for different trade-offs).
- tobin_baker
  
  This insight that some data structures are inherently unscalable can be formalized: https://files.sri.inf.ethz.ch/website/papers/popl168gf-attiya.pdf.
  - viega
    
    That’s a good paper, thanks for posting it.
    
    On first glance, I’m not reading it as directly formalizing unscalability. The result I see is, more or less, if you need to linearize a modification across multiple threads, synchronization of some kind is required.
    
    But to turn that into how unscalable an algorithm is depends on what atomic operations are your synchronization points, and how likely is contention on those points.
    
    For instance, they can prove that an add() operation to a hash table requires synchronization, which is intuitively correct. But, with a hash table, synchronization isn’t too hard to scale, because, for the most part, you can push the synchronization all the way into the bucket. Given that a hash table with a good hash function will randomly distribute access across the buckets, the real-world contention drops off VERY quickly.
    
    But, you could have your one synchronization point be a table-wide lock. That will kill your performance very quickly.
    
    The paper argues for more effort put into speeding up atomic primitives like compare-and-swap. But an uncontended compare-and-swap at the bucket level isn’t worth worrying about.
    
    Realistically, the paper doesn’t give a direct bound on any particular data structure. You still need to analyze those synchronization points and the probability of contention, which wouldn’t just include, “odds of us competing”, but would have to take into account the length of the critical section.
- epickrram
  
  The Disruptor has a M:N configuration (in fact :1 is just a specialisation of :N for any producer count). Back when I worked on it, the Disruptor did not have the ability to drop old messages, it would just fail to enqueue a new one. This was handled by periodically checking all consumer sequence numbers, to make sure that they had all consumed the point where an new append would wrap over old data.
  - viega
    
    Interesting, thanks. That sounds like the one paper I did find after the fact, where it didn’t really match my needs for a ring buffer to shed old data. Definitely will check it all out when I get some time.
- mikedorf
  Two questions:
  
  How would you implement batch enqueue/deque (it looks like you could just attempt to increment tickets by number greater than 1)? The real magic of a circular buffer is that you can write up to capacity - committed bytes in a single transaction and read up to committed, which is important for variably sized data moving between producer/consumer that are more or less matching pace with each other
  
  What about broadcast? Sometimes you want every consumer to be delivered every message, or for multiple consumers to subscribe to the same producers (eg: pub/sub). It looks like this scheme is easier to adapt than a classic FIFO which usually as a separate message broker task, but I’m not sure.
  
  that’s why many data structures are backed by arrays of size to powers of two. It’s such a common paradigm, we’ll just roll with it.
  
  I was taught this trick was an optimization, but I found it’s actually a correctness thing. If the modulus is not a divisor of the max value then upon overflow it will be incorrect (say the capacity was 7 and you had an 8 bit counter, 255 % 7 = 3, so when you increment and overflow to 0, you skip 4 slots in the buffer. This is one of those times where you rely on the fact that unsigned integer overflow is defined to wrap.
  - viega
    
    Great questions.
    
    For a ring buffer yes, you can absolutely extend to fetch-and-add your way into a larger number of chunks at once. Though, with the arbitrarily sized chunks, you essentially write the data first, then get ordered by when the pointer is enqueued. That would still work here, if you link the chunks you take together. The two-tier structure is still better for arbitrary lengths, so you don’t have to CAS into every single word in the ring you want to write to.
    
    For broadcast, you can either go with a m:1 approach (m threads enqueue messages for 1 thread in a dedicated ring), or never have dequeues at all. Instead, you could just announce an epoch plus index, and interested parties look as soon as they can; they’ll know if that message dropped. I’d do the announcement in whatever way makes sense, e.g., could be a write to a set of fds.
    
    For that use case, if you’re never worried about starving readers, and only dequeue by overwriting, the approach totally would work, but I’d change how it deals with easing contention. Basically, writers would only be contending with other writers, so if they fail a few CAS ops trying to enqueue, they can do their own exponential backoff proactively, or you could achieve the same thing in the other direction, by having them set a help bit so that other writers drop what they’re doing to help us ensure we make progress.
    
    Usually, the former approach does better for me– it’s easier to implement and tends to be easier to get it to perform, but YMMV.
    
    That might be a fun thing for me to build sometime soon, thanks for bringing this up.
    
    As for the modular arithmetic, it’s true that the overflow leads to a bias. That is definitely worth worrying about for 32 bit values, and you put that in the context of, say, a hash table with 999 buckets, you’re probably right in terms of that being the reason people originally recommended powers-of-two.
    
    However, for a ring where indexes are ordered via a sequential epoch, not many systems will ever get that counter anywhere near the top of the space to realize the bias. So the non-power-of-two size isn’t really an issue there in practice.
    
    I tend to stick with power-of-two sizes anyway; it’s easier in a number of little ways. And that’s really somewhat separate from what I saw saying– I could keep the power-of-two size, yet switch to % for clarity. I’m not sure there’s any speed reason any more, even if a compiler can’t figure out that it could strength reduce, or at least, not a significant one.
    
    Still, I doubt I’ll bother to ever change that :)
- Moonchild
  
  The x86 family has long supported a 128-bit compare-and-swap, but until recently, it required instruction-level locking to use, because it did not support atomically loading 128 bits into a register otherwise. So on old hardware, you’re technically using a lock with a 128-bit CAS, but 🤷‍♂️
  
  i’m not quite sure what this is supposed to mean, but i don’t think it’s right. most x86 hardware (not quite all, but most; i believe windows 8 requires it, for instance) supports cmpxchg16b, which is a 128-bit compare-and-swap, and it does not have significantly different performance characteristics from other atomic memory operations. yadda yadda avx atomic reads (/waitfree atomic writes) and libatomic software fanangling and whatnot, but you have in fact for a long time been able to write nonblocking algorithms for x86 with 128-bit atomics, and not much has changed in terms of hardware support for that in a long time
  - viega
    
    Check out the Intel x86 instruction reference. You will see:
    
    This instruction can be used with a LOCK prefix to allow the instruction to be executed atomically. To simplify the interface to the processor’s bus, the destination operand receives a write cycle without regard to the result of the comparison. The destination operand is written back if the comparison fails; otherwise, the source operand is written into the destination. (The processor never produces a locked read without also producing a locked write.)
    
    Until the very recent past (it might have just been a few weeks ago), all their x86 hardware could do the atomic swap, but after it basically loaded each 64-bit word non-atomically. The lock prefix is an actual instruction level lock, and if you’re writing in asm and not issuing the lock you’re not getting atomicity.
    
    pkhuong
    
    At a high level, you’re correct: in order to get an architecturally (n.b., this is a stronger requirement than reality) guaranteed atomic 16-byte load, we had to use LOCK CMPXCHG16B. Nowadays, Intel and AMD have both blessed aligned 16B vector loads and stores as atomic.
    
    However, the LOCK prefix has nothing to do with lock-freedom. It simply says that you want the operation to happen atomically, and with strong ordering (flush the store queue in the Queue TSO model). In practice, on contemporary x86 (except for locked operations that span two cache lines), this happens with the same message passing memory coherence mechanism as regular loads and stores: the core gets exclusive ownership of the cache lines, and lets other cores snoop it once the LOCKed read-modify-write has completed (or rolls back the whole RMW).
    
    The fact that memory-register XCHG doesn’t need a lock prefix doesn’t mean anything except that Intel really values backward compatibility (with multi-socket 286 systems in this case…).
    
    TL;DR: if one considers LOCK-prefixed instructions to violate lock freedom on x86, so do all memory loads and stores.
    
    viega
    
    Yes, I 100% agree w everything there; It’s still a lock free algorithm, and the hardware providing atomicity and a fixed upper bound on executing makes it just as good as if 128-bit values were atomically loaded to registers. The progress guarantee is different from how the black box is implemented.
    
    Thanks!
    
    Moonchild
    
    not really sure what you’re talking about. the section you quote is written slightly anachronistically, but it’s basically getting at something that applies to single-word cas as well on basically every processor (you can see the same verbiage in the section about regular cmpxchg): the location being cased is acquired exclusively under the cache coherency protocol, so concurrent failing cases will still contend with each other. there is not really any such thing as a ‘bus level lock’, nor has there been in an extremely long time; processors will use a cache coherency protocol such as mesi (usually a more advanced version), and the ‘lock’ terminology in x86 is an anachronism
    
    it is true that an algorithm written for x86 using 128-bit cas will often start by doing a pair of 64-bit reads. that’s because there’s no native fast (in terms of constant factors) instruction for doing a 128-bit read; but that has nothing to do with cmpxchg16b, which is a perfectly competent atomic 128-bit cas. and you even can use it to do an atomic 128-bit read (just cas from 0,0 to 0,0; whether it succeeds or fails, it’ll have no effect and tell you the data there); it’s just slower than doing two separate reads, and it’s not difficult to make algorithms cope with such tearing. that’s just a footnote. it is also true that split locks are technically a thing that exists on x86 but that is just another footnote
    
    are you getting this info from chatgpt or something
    
    viega
    
    My understanding always has been the lock modifier pins the cache line, and that it’s done for the lack of 128-bit loads. I never said the instruction itself wasn’t atomic, just that you couldn’t atomically load it without the cache-line lock. Perhaps that’s not correct, because I do lose interest fast at the hardware boundary.
    
    I will say, 3-4 years ago I was mucking around w/ generated ASM, especially because MUSL did not (and I believe still does not) have an implementation of atomic_compare_exchange_*() for 128 bit operands. Definitely removing the lock led to (very) occasional issues on hardware that was only a year or two old at the time.
    
    Also, I just wasted a bit of time googling for my sanity, and didn’t find anything to contradict my understanding. in fact, I did find this page which is not authoritative, for sure, but aligns with my understanding.
    
    But that article is a couple years old, so not as recent as I remembered (probably meteor lake?) I’ll chalk that one up to my advancing age messing with my perception of time.
    
    Definitely feel free to point me to something that definitively shows that I’m wrong. I’ve definitely been guilty of carrying around conventional wisdom past its expiration date, and am happy to update my understanding.
    
    But chatgpt? Really? LOL.
    
    viega
    
    And just to be clear, my understanding of the semantics for plain old cmpxch was also that it locks the cache line, if you were bringing forward your ancient code with 32-bit loads. I don’t think it’s just boilerplate; more that Intel’s always been very committed to backward compatibility, possibly to their detriment.
    
    Moonchild
    
    ok, i guess the salient point is just this one:
    
    it is true that an algorithm written for x86 using 128-bit cas will often start by doing a pair of 64-bit reads … it’s not difficult to make algorithms cope with such tearing. that’s just a footnote
    
    ibraheem website saying ‘while 128-bit atomics were possible, they weren’t very practical’ is not really true. for example, if you use counted pointers, a torn read will just cause you to lose your next cas, and that cas will give you an atomic read for free; that will only happen under contention, in which case you could have expected to lose your cas anyway. or for plain reads, use the counter as a seqlock. using vmovdqa to get an atomic 128-bit load is not necessarily particularly helpful because it slows down the uncontended case. it’s mostly of interest to stdlib/language implementors who want to provide a uniform suite of read/write/cas/etc. iirc gcc would generate cmpxchg16b for the builtin atomics if you passed it -mcx16 though
    
    details of how atomic rmw works vary by uarch. fabian giesen hinted that intel uses an ll/sc-like construct internally for all of them