Think you can’t interpose static binaries with LD_PRELOAD? Think again

24 points by gioele

Screwtape

Rather than emulating the entire CPU with QEMU so you can trap syscalls, why not use ptrace(2) so any CPU-intensive portions of the program still execute at native speed? Better yet, you won't need to wait for a patch to be pushed upstream.

YogurtGuy

High-performance syscall trapping on linux is hard (because the syscall ABI is just the syscall instruction). ptrace works great! And it's even better when combined with seccomp-bpf since it makes it easier to selectively intercept system calls.

The problem is that, now every syscall requires jumping between the kernel and userspace multiple times (up to 5 times depending on the technique). In addition, you're at the mercy of the kernel scheduler to reduce the amount of time between when a syscall gets initiated and when the ptrace()-ing process gets scheduled.

LD_PRELOAD is much faster, for programs which don't invoke syscalls directly.

What they're doing here is binary rewriting, via QEMU, for syscall interposition. Using QEMU for this will probably cause performance degradation (as opposed to performing a more surgical binary change). Basically the way this works is that a program, either at runtime or perhaps as a preprocessing step, disassembles the target program, and then replaces all the syscall instructions with calls to a function to replace the syscall. Easy!

Except... it's not easy at all! First, there's no guaranteed space on the stack for your custom syscall handler, so you need to do stack swapping. But be careful to properly handle the stack in the presence of signals! Next, (focusing on x86-64) the syscall instruction is encoded in only two bytes. As a result, you don't have much room to work with to insert a call to your syscall stub. You can try to move instructions around, but reliably disassembling binaries is hard. e9patch is one approach which tries to work around this by finding assembly fragments surrounding the syscall instruction which can be modified to allow for the insertion of a longer jump instruction.

I think lazypoline and zpoline are the current best approaches for high-perfomance syscall interposition.
- 0x2ba22e11
  
  I wonder if it's possible to patch all the spots that are easy to patch, and then for spots that happen to be hard to patch, use a seccomp-bpf filter to check the address of the syscall instruction and have interrupt only syscall instruction that doesn't come from one of the easily-patched spots.
  - lcapaldo
    
    This is similar to what the parent linked lazypoline is doing though it’s using SUD instead of seccomp for reasons they describe in the paper.
    
    0x2ba22e11
    
    Thanks!
  - donio
    
    Can this be done with strace --inject? I tried a bit and I am able to change the syscall results but I don't understand how the value I specified turns into the values returned:
    
    strace -e clock_gettime --inject clock_gettime:poke_exit=@arg2=192105DB busybox date clock_gettime(CLOCK_REALTIME, {tv_sec=3674546457, tv_nsec=306230445}) = 0 (INJECTED: args) Mon Jun 10 04:20:57 PDT 2086
    
    Where busybox is statically linked.
    
    Edit: Needed to swap endinness:
    
    strace -e clock_gettime --inject clock_gettime:poke_exit=@arg2=00B46D38 busybox date clock_gettime(CLOCK_REALTIME, {tv_sec=946713600, tv_nsec=553997170}) = 0 (INJECTED: args) Sat Jan 1 00:00:00 PST 2000
- yshui
  
  If you turn to simulating the CPU and/or dynamic translation you can actually do a lot more than just replacing syscalls. That's how valgrind works, see also Pin.