mmap in Go considered harmful

13 points by runxiyu

jtolio

Note that this blog post is from 2018, which is before changes to Go's preemption strategy. Check out this 2020 GopherCon talk about how Go handles preemption these days: https://www.youtube.com/watch?v=1I1WmeSjRSw

olliej

Oh, I see, this is a bad title, it is about MMIO. From there it sounds like Go uses virtual threads/fibers/whatever-they're-call-this-decade - so because the IO page faults can happen at any point your working "thread" gets blocked but more work cannot then be scheduled because there's only one real thread. This is a standard problem with virtual threads not Go specific (see also virtual threads breaking forward progress guarantees that depend on tasks being guaranteed to get cpu time).

runxiyu

Oh, apparently go may require prefaulting mmap is similar

fanf

Previously (10 comments)
dgl

vmsplice() on Linux could be interesting here, both in a good and bad way. If you have a pipe another process is going to read from, you can essentially move the page faults to the reader. That could also be used to move the page faults to a read() by writing/vmsplicing one byte from each page (IOV_MAX is 1024 so this should be only two syscalls per 4MB of pages). Once that read completes you know the pages are faulted in and as you've done a read() Go will schedule this correctly.

Obviously Linux specific but would be interesting to know how that idea performs compared to Tedu's idea of write() to /dev/null or using posix_madvise (although I don't know how you know posix_madvise has completed paging in so it's a bit different).

shanemhansen

So crazily enough, I've been thinking about this exact blog post for over a decade. Ever since I thought "hey, a page fault on swap would probably really mess up the go scheduler because it's really going to block the underlying thread (M in go terminology iirc properly)".

Thanks to Valyala for writing this and sharing!

mxey

This sounds similar to what happened with Varnish
- 0x2ba22e11
  That's a different problem, with page evictions and faults needing to update the TLBs for all CPUs. There's a paper that goes around every now and then about it called "Are You Sure You Want to Use MMAP in Your Database Management System?" at https://db.cs.cmu.edu/mmap-cidr2022/
  
  I think the go scheduler blocking is a more severe problem?
  
  I think Varnish got designed that way because it was written back when:
  
  SSDs didn't exist, so a fairly small thread pool could easily exhaust the IOPS of a disk array anyway
  
  and async disk I/O wasn't a universally available feature, so using one thread per concurrent disk I/O request was at the time the only portable way to do that (at least on FreeBSD and Linux)
  
  I think Varnish was written for FreeBSD first and I vaguely remember seeing grumblings by PHK (Varnish's author) to the effect that FreeBSD's virtual memory system was better at the time at doing this fast and scalably, but who knows what's changed since then. Also I might have mis-remembered.
  - fanf
    
    Howard Chu, author of LMDB, wrote a response to that paper; he was righteously vexed that LMDB is a counterexample to many of the paper’s assertions… https://lobste.rs/s/n40bdi/are_you_sure_you_want_use_mmap_your_dbms
    
    (edited to add) The RavenDB commentary is also worth a read https://lobste.rs/s/ltrw2p/re_are_you_sure_you_want_use_mmap_your
    
    0x2ba22e11
    
    Okay so now coming back after having read all that: I think that the disparity between Varnish having a lot of problems with TLB shootdowns and the cost of updating page tables, but many other DBs having less, is explainable. The BBC's Varnish workload has a small amount of RAM covering a huge working set on a big array of SSDs that exceeds the amount of RAM, so it churns page table updates the maximum amount possible. The RavenDB people expect their users to either have more RAM than the (hot part of the) working set or to have a relatively slow backing store that can't generate that many page table updates anyway.
    
    0x2ba22e11
    
    Thank you!
  - runxiyu
    
    See also: a potential alternative strategy
  - unwind
    
    There is an open proposal (#68769) to expose Prefetch to user space. It's not exactly a solution, but it's .. something?
    
    fanf
    
    That’s an intrinsic corresponding to a prefetch hint instruction, which just moves data closer to the CPU in the cache hierarchy. It doesn’t help with missing pages in mmap() because a prefetch hint instruction doesn’t trigger a page fault exception, so it doesn’t tell the kernel that the program wants to use the missing data.
    
    The kind of prefetching needed here is IO or virtual memory page prefetching (not CPU cache prefetching). To trigger an IO prefetch the program needs to trigger real page faults or make corresponding read() syscalls. If it wants to avoid the latency hit it needs to do that on a separate worker thread; golang will do that automatically for read() but probably not for page faults. See also https://lobste.rs/c/diojhk