I built a 2x faster lexer, then discovered I/O was the real bottleneck

23 points by lonami

snej

Apple uses SQLite extensively for its applications, essentially simulating a filesystem within SQLite databases.

The first part is true, but not the second (AFAIK.) Apple's SQLite-based APIs — CoreData and the newer Swift one — are database APIs, with an ORM and queries; they don’t look anything like a file system. And Apple's filesystem APIs talk to the file system, not a database.

The HN thread this quote links to is about the Photos app, which uses a database to store metadata (and thumbnails?), but the photos are still stored as individual files. At least they were the last time I looked.

cblake

@snej's reply is very good, but some supplemental points:

minimum is a better order statistic than median { unless you are an OS architect / someone who wants to measure/improve scheduling under competing load or similar } per URI to his benchmarking blog post. Because a minimum time is constructed to filter out system noise, it is also much more well behaved and requires far fewer runs or samples. So, this can be a perf optimization of benchmarking itself.
First gen cpio as used in procs to speed up rapid re-query of Linux /proc is only like 115 lines or so of Nim to implement a reader for from scratch. Either cpio or .tar could be a standard Unix format as the author/TFA wants merely with a convention - last file added is the index. More or less this recapitulates Zip. Bonus, you can put whatever you want in that probably binary index file like extended attributes or whatnot. So, you really don't have to be beholden to tools like the dar TFA mentions.
There is a tool called pixz that does an indexed tar file, and that indexing aspect is probably not unique to that. Anyway, doing his own index file is probably not much work compared to what he's already done.
Instead of stock gzip/zlib, there is a thing biology / DNA researchers have done called bgzip that allows parallel decompression which may be a way to be backward compatible, but also get decompress performance within ~1.5..2x of zstd times (but nowhere near Zstd compression ratios)

hailey

Each syscall costs roughly 1-5 microseconds

People keep saying this. That's... really slow. Is it a macOS thing for syscalls to be that slow, or is the number just some received wisdom people repeat without measuring?

A simple C program can do 1 million read syscalls in 200ms on my desktop (Linux x86_64, AMD Ryzen 9 3900X). That's 200ns per syscall, or 5-25x faster than claimed.

MaskRay

This reminds me of a lld/MachO speed up patch https://github.com/llvm/llvm-project/pull/147134 ("[lld][MachO] Multi-threaded preload of input files into memory"). lld/MachO does not have parallel input file scanning, so the IO overhead is quite large. This approach helps.

There is even a further optimization: use a wrapper process. The wrapper launches a worker process that does all the work. When the worker signals completion (via stdout or a pipe), the wrapper terminates immediately without waiting for its detached child. Any script waiting on the wrapper can now proceed, while the OS asynchronously reaps the worker's resources in the background. I had not considered this approach before, but it seems worth trying.

The mold linker utilizes this behavior, though it provides the --no-fork flag to disable it. The wild linker follows suit. I think performing heavy-duty tasks in a child process makes it difficult for the linker's parent process to accurately track resource usage.

My feeling is that doing heavylifting work in the child process makes it difficult for the parent process of the linker to track the resource usage.

In contrast, lld takes a different, more hacky path:

async unlink https://github.com/llvm/llvm-project/blob/a72958a95dcb7d815c01e20cc819532151d1856d/lld/Common/Filesystem.cpp#L44
Call _exit instead of exit unless the LLD_IN_TEST environment variable is set. https://github.com/llvm/llvm-project/blob/a72958a95dcb7d815c01e20cc819532151d1856d/lld/Common/ErrorHandler.cpp#L108

Perhaps lld should drop the hacks in favor of a wrapper process as well. Aside, debugging the linker then would always require --no-fork.