Memory is slow, Disk is fast
20 points by runxiyu
20 points by runxiyu
tl;dr: loading data through mmap may be slower than streaming directly from a well-designed async I/O pipeline
The choice of mmap vs io_uring is a red herring, because the two pieces of code are very different. The io_uring approach uses 6 parallel worker threads, but the mmap code is faulting in memory serially from a single thread. The only similarity is that they both count 10s in a single thread.
This is more "mmap is slow" than "memory is slow" imo. That maybe shouldn't be too surprising if you think of a loop over newly mmapped data as going into the kernel every 4KB. You don't need anything super fancy to beat it; just using read(2) is faster in many cases and many tools switched from mmap to read for performance before NVMe and io_uring were ever thought of.
But nobody would have clinked on a link with that title. Therefore the title had to be misleading (on purpose, obviously, since we had the same conversation about "part 1" of this series).
I will say it: I would have clicked on a "mmap is slower than just reading in this case"-style headline. I love headlines that have that sort of texture. More of those please!
The articles's benchmark unfortunately has some problems, discussed here. The mmap code just faults in the entire range serially, whereas the io_uring code uses multiple worker threads. Even if the counting is done on a single thread, the portion of code that does I/O is done differently (parallel vs serial), which distorts the final results.
Wouldn't Huge Pages help?
It should, because it would effectively be doing readahead, but madvise(MADV_SEQUENTIAL) should theoretically do readahead with the default page size. I haven't tried verifying this, but this LWN article suggests it might work:
So, how fast could I do mmap with two threads. If I know my page size and get thread A to access a byte on the next page while thread B is busy dealing with data in the current page? Does the manipulation in the kernel heavily affect all threads or just the one causing the fault? I think it only affects the thread which caused the fault, but I admit I couldn’t be certain on that.
The Flash web server (1999) used helper threads to page in mmap()ed files. I expect the trick still works at least for offloading IO latency. I guess the main thread will be oblivious to page table shenanigans until it encounters a TLB miss. It would be nice if the core’s prefetcher can populate the TLB in the background before the real data load instructions run into the page boundary … tho some searchengineering suggests hardware prefetching might not be eager enough, and explicit software prefetch is required to populate the TLB.
The MAP_POPULATE flag will ask the kernel to preallocate pages to reduce faults, and/or you can call madvise with MADV_SEQUENTIAL to get the kernel to prefetch pages for you.
But as others have said, it’s probably faster to use read(). Allocate two buffers and use two threads to make a producer/consumer flow.