There Is Life Before Main in Rust

42 points by mmastrac


ssokolow

Go is a notable exception in that it avoids the C runtime on most platforms, but Apple requires a C runtime to access syscalls.

Apple uses libSystem.dylib as the ABI stability boundary for syscalls, NT-lineage Windows has ntdll.dll as the ABI stability boundary, not syscalls. The BSDs use libc as that boundary. On OpenBSD, I believe Go sets some kind of "opt out of NX-bit enforcement"-esque metadata flag to opt out of having the kernel kill it for attempting to syscall from a location outside of the read-only libc mapping the loader set up.

EDIT: To clarify, libSystem.dylib contains the functionality which would normally be libc.so plus other things so, in that respect, it's the same BSD-verse "libc is the stability boundary" dance.

EDIT: As of Go 1.16, Go now uses libc on OpenBSD to comply with their syscalling policy.

Linux is the anomaly [uncommon] in having stable syscall numbers instead of a "piece of the kernel that gets loaded into process address space as a dynamic library and shares an unstable enum definition of syscalls with the kernel-mode code" because Linux and glibc aren't developed together in the same repo the way everyone else does it.

There’s an entire ecosystem of processing that happens before the function you declared as main starts up. C uses this to configure allocation, file access, thread-local storage and other C runtime services. Rust uses this time to configure parts of its own language and runtime. Specifically, Rust has infrastructure to handle panics and unwinding. Rust also needs to translate the C-style program arguments into its own std::env::args interface.

On Windows, the C runtime is also responsible for parsing the CP/M-style command string that MS-DOS copied (and Windows's subprocess spawning APIs continued) into a POSIX-style argv array. That's why Python's subprocess module documentation has a section named Converting an argument sequence to a string on Windows about how it will convert your argv array to a string following the quoting rules baked into the MS C runtime, which the invoked subprocess's own parser can deviate from if it so chooses.

On Linux, this hook is usually named _start and the linker automatically adds whatever symbol has that name to the binary.

Not quite. If an ELF-format binary is an executable rather than just a library, the e_entry field in its header (offset 0x18) contains the address for the loader to jump to after setting it up in memory. _start is GCC's convention (which things like NASM copy, IIRC) for how you specify what e_entry should point to when you opt out of libc providing it for you.

A similar hook exists on Windows, and boots the executable in a function named _WinMainCRTStartup. At this point the C runtime has a chance to configure itself, and the way that all runtimes do this is via initialization functions.

Which the loader finds via AddressOfEntryPoint in the PE header. Offset 0x0028 from the start of the PE header, which comes after the MZ (DOS EXE) header and DOS Stub.

EDIT 1:

Making the smallest Windows application and then Tiny PE are a good way to learn more about the ins and outs of PE headers through the vehicle of their authors figuring out how they can make smaller executables. (Tiny PE violates the PE spec in accepted-by-Windows ways such as overlapping stuff where it knows the OS won't read one of the things being overlapped and stuffing code into unused header fields... but if you go this far, the smallest file Windows will accept is dependent on which Windows version you run it on.)

See also A Whirlwind Tutorial on Creating Really Teensy ELF Executables for Linux.

EDIT 2: OK, done.