go may require prefaulting mmap
18 points by eBPF
18 points by eBPF
The naive approach would be to walk the memory in another goroutine, to prime it. But that’s exactly the problem.
Erm. I think the naive approach would be to read the manual. And if not, it should be.
POSIX tells us to use posix_madvise()
with POSIX_MADV_SEQUENTIAL
. Linux also supports MAP_POPULATE
which lets you do it at mmap-time. I recommend this.
Your write-trick works because the kernel copies the user pages into memory (which triggers the page fault bringing them into the page cache). I don’t think it is guaranteed to do this, because the kernel can absolutely recognise /dev/null (and even if Linux currently doesn’t, it could, and other unixen can), and it’s going to be slower than just calling madvise.
POSIX tells us to use posix_madvise() with POSIX_MADV_SEQUENTIAL. Linux also supports MAP_POPULATE which lets you do it at mmap-time. I recommend this.
I would assume MAP_POPULATE
will cause the mmap
call to take a very long time (relatively), which will cause the same issue as direct memory access: because the runtime is not aware of that behavioural change (AFAIK go has no awareness of MAP_POPULATE
whatsoever) it’s going to “assume” normal mmap behaviour, and the long block is going to lock up the runtime. Which is the very issue it’s trying to avoid.
To my knowledge there is no “good” way to handle this case, Go does have an internal method for blocking syscalls which run long enough to need to be preempted (entersyscallblock
which I believe move the call to the system stack — and importantly off of the evented schedulers) but there is no public API for this. So you’d have to move the entire mmap syscall to a C function. Which i guess is no worse than using C to fault every page, but…
Also tedu’s a long time openbsd contributor, I don’t know how much he uses linux if at all.
I would assume MAP_POPULATE will cause the mmap call to take a very long time
Mapping the pages is going to take “a long time”, whether you are waiting for the kernel to do it on the other side of write() or you are waiting for mmap() to do it, but mmap() will do it faster.
I think the issue is that someone wants to do “something” (like run another goroutine) while those page faults happen…
Go does have an internal method for blocking syscalls which run long enough to need to be preempted
… because it’s not usually the system call (mmap) that blocks, but the page fault when accessing memory. go (or your program) can’t do anything about that exactly, but madvise can make sure this is as short as possible, and…
To my knowledge there is no “good” way to handle this case,
…If madvise isn’t good enough, and you really want something going on while those pages are coming in, and you really don’t want to read() in a loop like you have better things to do, you can do this:
mmap() space to get an address, then wait for SIGSEGV in that range, finally just load the page you want into the address you know it needs to be in.
This works for go, java, rust, perl, C, and so on. And I hate it. It’s slower (total throughput) than everything else, but it has the best worst-case latency of anything reasonably portable.
Just remember if you call go’s read from inside the signal handler that you unmask SIGSEGV and put the stack back first.
If you prefault mmap… Why not just use read(2)
?
Your question is answered in the article.
In general, mmap can be faster because the kernel doesn’t need to copy any data into your memory, it just appears there.
Just take a peek at every 4096th byte to get the pages into memory. […] probably faster than a system call.
Prefaulting is faster than read(2) when the page cache is already hot.
sendfile() should be a simpler solution to this problem.
The Flash web server paper that @tedu linked to near the end is a classic examination of how fast a server can go without sendfile(). I read it over 25 years ago, dunno how well it has aged!
On very high performance servers such as Netflix cache appliances memory bandwidth is the limiting factor (and bus bandwidth in general) so it’s vital to avoid copying.
sendfile() should be a simpler solution to this problem.
Nobody has ported sendfile(2) to OpenBSD yet, alas
Where’s the code that this article is referencing?
Appreciate the link, thank you. Given this code has no meaningful documentation, tests, or benchmarks, I wonder why it keeps showing up on the lobste.rs frontpage…
The code does not. Tedu’s various essays tend to, because they’re usually about interesting tidbits and corner cases.