Why are we worried about memory access semantics? Full barriers should be enough for anybody

39 points by abareplace

wucke13

I'd just wish that there would be a way to do possibly tearing, non-atomic bytewise read/write without atomics and without UB and without inline assembly in Rust. Basically ptr::read_volatile without the external device declaration allowing this only for memory outside allocations.

lina

Atomic relaxed byte read/write are just a normal load/store on any reasonable architecture. You can already do that, if you control all users. No UB there.

The problem comes when one side is untrusted, not Rust code, there's word size mixing or unaligned loads/stores, etc. On paper, there's no safe way to share memory with an untrusted (or uncontrolled) user in Rust and guarantee no UB. In practice, when I ran into this, the consensus was "yeah, it's UB on paper but in practice just use atomics, it works on real implementations and architectures". This is essentially harmless UB because there's no way for the Rust side to observe the difference between the unknown user using atomics and not on real architectures, therefore the Rust code can't observe the existence of UB, therefore it's not really UB (just a much more narrowly scoped "you get arbitrary tearing").

Of course, this mostly applies when both users aren't Rust in the same process/binary. If they are (like the mixing atomic word sizes scenario), then the UB dragons come back if the compiler can ever prove that both sides are accessing the same memory.
- wucke13
  
  Thank you for the assessment!
  
  Atomic relaxed byte read/write are just a normal load/store on any reasonable architecture. You can already do that, if you control all users. No UB there.
  
  So in practice, it likely suffices to say: The LLVM codegen on the following targets seems to do the right thing here despite it being UB as far as the compiler manual is concerned. I don't like this approach, but it is what we are currently (kind of implicitly) doing.
  
  The problem comes when one side is untrusted, not Rust code, there's word size mixing or unaligned loads/stores, etc
  
  This goes beyond my ask though. I assume that the reads/writes are byte based, therefore without alignment requirements. Also as I read the data into [u8; N] of some compile-time known size N, there is no illegal bit pattern; literally any of the 2<superscript>N</superscript> possible values that could end up in the byte slice is fine for me in my use-case. I just want tearing reads and writes on byteslices without use of atomics and I'm fine for any possible data in them after the data race. Just, please, rustc, please don't corrupt something else (as it unfortunately is allowed once UB is present).
  
  Of course, this mostly applies when both users aren't Rust in the same process/binary. If they are (like the mixing atomic word sizes scenario), then the UB dragons come back if the compiler can ever prove that both sides are accessing the same memory.
  
  Well, for my use case both ends are the same Rust codebase. The "if the compiler can ever prove" part irks me too; in my specific use-case it can never prove such a thing: its undecidable. The compiler doesn't know the bytecode-to-be-interpreted when generating the object code, and thus it can never decisively determine whether this will or won't happen. The same holds true when using a libc/syscall based synchronization primitive, the compiler can't know its semantics.
  
  Last but not least, unfortunately Wasm does AFAICT not forbid unaligned atomic access (tough it does explicit reserve the right for severely degrade performance upon non-aligned atomic operations).
  - orib
    
    This goes beyond my ask though. I assume that the reads/writes are byte based, therefore without alignment requirements.
    
    I don't believe this is guaranteed to work, even on x86. The Intel manual, volume 3A, section 9.1.1, says in part:
    
    9.1.1 Guaranteed Atomic Operations <snip> The P6 family processors (and newer processors since) guarantee that the following additional memory operation will always be carried out atomically: • Unaligned 16-, 32-, and 64-bit accesses to cached memory that fit within a cache line. Processors that enumerate support for Intel® AVX (by setting the feature flag CPUID.01H:ECX.AV
    
    If you span a cache line, tearing is allowed unless you use the lock prefix.
    
    https://cdrdv2-public.intel.com/812386/253668-sdm-vol-3a.pdf
    
    lina
    
    Parent is simply asking for 8-bit loads and stores to work, and obviously then the overall data load across bytes then can tear, and parent is okay with that.
    
    orib
    
    Whoops, I misread.
    
    lina
    
    So in practice, it likely suffices to say: The LLVM codegen on the following targets seems to do the right thing here despite it being UB as far as the compiler manual is concerned. I don't like this approach, but it is what we are currently (kind of implicitly) doing.
    
    It isn't UB! Just declare your shared data as an array of AtomicU8 and read/write it with Relaxed ordering. Since you're using atomics, it's not UB, and since the ordering is relaxed, the actual codegen will be just normal load/stores anyway. The reason this isn't UB and sharing the non-atomic pointer is, is that the compiler isn't allowed to make extra assumptions in the atomic case (such as that the data doesn't change or that it can merge multiple loads/stores together). Essentially, AtomicU8 is just turning off compiler optimizations for the loads/stores here, and that is sufficient for it not to be UB.
    
    In other words, you're saying you don't want atomics, and my answer is you do want atomics because in practice it's literally just a "turn off UB" switch for your use case, nothing more.
    
    Where this gets tricky is weirder use cases I mentioned, but they don't apply to you (edit: they might, see my reply below).
    
    wucke13
    
    Thank you again for elaborating! So what I deduce from this is:
    
    for non-atomic instructions: atomic load(s)/store(s) on AtomicU8s, guarded by a global ReadLock on the linear memory (i.e. with tearing reads & writes)
    
    for atomic read instructions: the same as above
    
    for atomic write instructions: the same as above (but guarded by a global WriteLock on the linear memory)
    
    We already have these global RwLock on each linear memory (to avoid UB with conflicting reads/write via the pointer API while another thread causes a reallocation of the Vector), so we can just use that one to guard atomic writes too. Note the semantics of a ReadLock in this case actually allow reading and non-atomic writing, and the WriteLock allows reallocation of the backing memory for the grow instruction and (according to this posting) also atomic writing.
    
    I would prefer a protocol without locks for any kind of read/write (except for the reallocation case which will always require a lock), but this will do at least without UB. We will have to benchmark to see if the additional cost of keeping fine grained locks for the atomic writes outweighs the performance penalty of lock contention for a global lock.
    
    lina
    
    It's actually potentially simpler than that. You don't need an RwLock. You just need a simple mutex.
    
    Non-atomic loads/stores: just plain AtomicU8 R/W, with tearing for larger types
    
    u8 atomic loads/stores: same
    
    u16 and larger atomic loads/stores: same, but protected by a global Mutex (or RwLock if you want to split loads/stores even more).
    
    It is permissible for anything that isn't a wide atomic to tear/conflict with it. The only possible caveat here is that you should check what the wasm memory model requirements are. It's possible that it requires non-tearing non-atomics (like real CPUs support), and if so, you'd have to extend the mutex to all reads/writes and it becomes a lot heavier...
    
    However, if you already have a RwLock, then that's fine too (taking the write side is equivalent to a simple murex as far as this design is concerned). You could also just take the read side as you already do anyway, and then use a separate nested mutex (or multiple indexed by a hash of address as I said) independent of your reallocation RwLock logic.
    
    Edit: I should mention, the above says treat atomic/non-atomic the same, but actually load/store access ordering matters. Non-atomic instructions map to atomic Relaxed, but atomic instructions need to use the ordering appropriate for the instruction or whatever the wasm spec says. If in doubt, SeqCst.
    
    quad
    
    Why?
    
    wucke13
    
    To implement a WebAssembly interpreter :)
    
    The issue we have: The Wasm execution model now features shared (linear-) memory. Accesses to this memory may be atomic. Because there are no strict alignment requirements by WebAssembly itself, all operations operate on bytes. I.e. to get an i32 from the linear memory at a given index, we just copy (think std::ptr::read_volatile on a subslice of an allocation of UnsafeCell<u8>) the 4 bytes starting at that index into a temporary, stack-allocated buf: [u8; 4], and call i32::from_le_bytes(buf). The resulting i32 is then used subsequently in the interpreter loop, e.g. as operand to some instructions. However, reading the Rust documentation, this is UB in multiple ways.
    
    Even with {read,write}_volatile, data races are UB. The only non-UB way (according to the docs, i.e. the official promise on the compilers' behavior): all access must be atomic. The fact that we have volatile reads & writes changes nothing if the memory in question is part of a Rust managed allocation, still UB. For non-allocation memory, i.e. memory mapped devices, there is an exemption, where using volatile pointer operations is not instant-UB. But, in our case the UnsafeCell<u8> is part of a Vec, so allocated, so its UB.
    
    Coming to the WebAssembly interpreter, the interpreter itself just interprets the instructions of the bytecode. If the bytecode commands a normal write to one address in the shared linear memory, while also commanding an atomic read to the same address from another thread, that is instant UB again! All racy read + write access to the same data must be atomic to avoid UB.
    
    Now the problem that I face: The WebAssembly instruction set allows for the guest program to decide whether to use atomic or non-atomic access. I want to respect that choice in my interpreter (instead of introducing some form of locking, or using atomic accesses always even if they are not required by the instruction currently interpreted). But, there is no way to do that in Rust (excluding inline assembly) that is not UB.
    
    As lina points out; likely this UB is fine. In x86 i.e. everything is effectively atomic with relaxed ordering. However, it irks me that I can not just write sound (as in free of UB) code that does this. I started a discussion in the Rust forum a while back, where there is some more detail and insight on the matter: https://users.rust-lang.org/t/is-there-a-sound-way-to-call-ptr-copy-nonoverlapping-with-count-1-and-dst-being-an-element-of-vec-unsafecell-u8/133382
    
    lina
    
    As I said, you can just use atomic reads/writes all the time and that is fine performance-wise in practice. However, I see a caveat. If you allow the bytecode to request atomics itself, then it can presumably ask for a wider-than-u8 atomic access to the data. That brings you back to UB territory, although in practice it will probably be fine (and this is much less likely to actually wake up the UB dragons than just using non atomics for the normal loads/stores).
    
    One solution to this issue to eliminate UB entirely again, if you don't mind the performance impact, is to emulate the wider-than-u8 bytecode atomics: just turn them into regular torn reads of u8 atomics, and wrap the access in a global lock for bytecode atomics. The lock can be a single global lock or, if you expect contention to be a problem, you can create a table of locks indexed by a hash of the (virtual/bytecode) address.
    
    In other words, your non-atomic and atomic u8 accesses map to Rust atomics, and your atomic greater-than-u8 accesses are emulated on top.
    
    wucke13
    
    Dully noted!
    
    That brings you back to UB territory, although in practice it will probably be fine (and this is much less likely to actually wake up the UB dragons than just using non atomics for the normal loads/stores).
    
    Do you have any kind of source for this that I can read into a bit? In particular, may this be purely decided by the semantics of the target CPU (micro-)arch, or does it depend on implementation dependent behavior within the LLVM code gen?
    
    lina
    
    It's just a compiler issue. The CPU does not care (on current modern CPUs), you just don't get non-tearing guarantees if you mix access sizes (and other subtle effects on the memory model possibly). The UB is just due to the Rust atomics model (which inherits from the C++ one) which is documented here:
    
    https://doc.rust-lang.org/std/sync/atomic/
    
    The other possible cause of undefined behavior in the memory model are mixed-size accesses: Rust inherits the C++ limitation that non-synchronized conflicting atomic accesses may not partially overlap. In other words, every pair of non-synchronized atomic accesses must be either disjoint, access the exact same memory (including using the same access size), or both be reads.
    
    This is why, in practice, if the compiler cannot prove/observe that this will ever happen (and if you're interpreting wasm, it probably can't, since whether this happens is controlled by the interpreted code at runtime), you'll be fine despite it being UB, because the underlying CPU memory model is completely fine with this (it just provides fewer guarantees, but no UB) and the fact you're using atomics already dispelled all the general race-related UB dragons that could show up.
    
    To give an example: If you do an AtomicU32 read of P, an AtomicU16 write of Q, and an AtomicU32 read of P in sequence, the compiler is allowed to assume the former didn't affect the latter and optimize accordingly (specifically, with Relaxed accesses, it could assume both reads of P return the same value). However, if you write an interpreter interpreting that code sequence, will the compiler of the interpreter be able to usefully apply any optimization to the interpreter loop that would be able to make that assumption and trigger unexpected behavior when presented with that code sequence at runtime? In theory yes (because the spec says it can and you could certainly construct an adversarial compiler that does it just to break your code on purpose), in practice, probably not.
    
    pervognsen
    
    I assume they are referring to the kind of memcpy you need for seqlocks. The C++ proposal calls it byte-wise atomic memcpy. There's a similar RFC for Rust and crates like atomic-memcpy.
    
    muvlon
    
    Seqlocks need something even weaker than that, they just need UB-freedom on the actual loads. The values will never actuslly be used if there was a racing write.
    
    LLVM already provides these semantics on normal unatomic accesses: in IR semantics, data races are not directly UB but yield a "poison" value that causes UB on most uses.
    
    wucke13
    
    Indeed not, but I talked to some of the devs behind iceoryx v2 Rust library, and they are concerned about similar challenges in the implementation of lock-free code with shared memory. IIRC, their main issues however revolved around the preservation of pointer-provenance though.
    
    majaha
    
    Your CPU that has grown to have caches the size of a small country? Yeah, those caches need to be flushed because the memory now has to be made visible to another processor, and they need to be invalidated because they may have been changed by another processor.
    
    This is a common misconception, but memory barriers are about store buffers and reordering loads etc, not flushing cache*. Your cache hierarchy is already consistent by default due to the MESI protocol.
    
    *Unless you consider a store buffer with store-to-load forwarding as an unusual kind of cache.
    
    david_chisnall
    
    Modern cache coherency protocols are a lot more complex than MESI. As I recall, AMD moved to MOESI with the first Opterons and cache coherency protocols didn’t stop getting more complex then. A lot of multicore systems support direct access to another near core’s L1. Remote stores have the biggest impact of a weaker memory model on the cache coherency fabric. You can write to another core’s L1 without pulling the line out of exclusive state. In the absence of a barrier, it’s up to their L1 controller logic to resolve the order of the writes and that can be arbitrary. On a TSO architecture, you can still do that but the bus message has to encode more to allow the ordering to be resolved deterministically and some other things may be blocked in the store queue until the acknowledgment is received.
    
    fanf
    
    Modern cache coherency protocols are a lot more complex than MESI.
    
    For example, ARM AMBA has facilities for running an atomic op in the remote core that has an exclusive cache line, so it doesn’t have to pingpong. Dunno if any x86 cores can do something similar.
    
    https://developer.arm.com/documentation/102714/0100/Atomic-fundamentals
    
    pervognsen
    
    PCIe 3.0 apparently supports remote atomics. Completely separate from that, GPUs used remote atomics for device memory across cores in the days when that kind of memory wasn't cached, presumably by implementing the atomic operations in the memory controller. Now that GPUs have coherent caches for that kind of memory, I'm not sure if remote atomics are still the standard implementation method.
    
    olliej
    
    I think that the most direct way we have to measure the cost of TSO is the arm Macs - they can run processes with or without TSO. I vaguely recall someone doing the direct comparison back when the M1 Macs were first released but I don’t know if the performance characteristics have changed since then.
    
    david_chisnall
    
    The overhead was around 7%. This is probably as close to a fair comparison as possible but there are a few sources of noise:
    
    If you’re running the same code in both modes, it will contain barriers and load-acquire/store-release instructions. This provides two sources of overhead in the TSO mode. First, decoder is going to discard the barriers, but they still consume decoder (and I-cache) bandwidth. This may slightly reduce ILP. Second, the load-acquire and store-release instructions have much simpler addressing modes, so you’re doing more arithmetic instructions than necessary. Apple’s cores do some neat tricks for horizontal forwarding but this is still reducing ILP slightly.
    
    Second, it’s not clear how Apple’s engineering team balanced the requirements to run native and emulated code quickly. As I understand their designs, they decode all loads and stores into something that expresses both the barrier semantics and the addressing modes, so effectively all loads become load-acquire and all stores become store-release in TSO mode. This is quite a simple thing to do, but if TSO is the primary goal you can always trade power and area for better performance in various ways. I don’t know if they did any of these.
    
    tobin_baker
    
    I guess the original philosophy of the C++11 memory model was “seq_cst oughta be enough for anybody”, so this attitude isn’t far from what the standards committee was thinking circa 2010 :)
    
    davmac
    
    That the C++11 memory model includes weaker ordering modes (than seq_cst) is pretty much the whole point of the article. It's not "ought to be enough" as in "should be strong enough", it's as in "should be performant enough", and it's in jest: the point is that it isn't.
    
    david_chisnall
    
    Indeed. The philosophy was that you should be able to correctly implement any concurrent algorithm or data structure using sequential consistency and then weaken the ordering if required for performance. This remains a good rule of thumb: implement with sequential consistency, benchmark on a weakly-ordered architecture, and move things to acquire-release semantics if doing so improves performance and you can prove it preserves all required orderings. Quite often, the atomic bit is so infrequent that even doubling the performance has no impact on overall throughput of the system, so leaving as sequentially consistent with a comment saying ‘this is stronger ordering than is required, if this is a bottleneck later you can optimise here’ is the right strategy. Sometimes, the atomic bit is the bottleneck and a small improvement can give big gains.
    
    One of the problems with this approach is that the implicit barriers of the sequentially consistent operation may make something else nearby correct. We had a FreeBSD bug from this a while ago, where the acquire and release operations were originally stronger than they needed to be (in particular, they were full barriers for the compiler) and relaxing them exposed some bugs from code that had accidentally relied on another function enforcing ordering.