Apple File System Reference (2020)
21 points by snej
21 points by snej
Apple File System is the default file format used on Apple platforms. Apple File System is the successor to HFS Plus, so some aspects of its design intentionally follow HFS Plus to enable data migration from HFS Plus to Apple File System. Other aspects of its design address limitations with HFS Plus and enable features like cloning files, snapshots, encryption, and sharing free space between volumes.
Most apps interact with the file system using high-level interfaces provided by Foundation, which means most developers donʼt need to read this document. This document is for developers of software that interacts with the file system directly, without using any frameworks or the operating system — for example, a disk recovery utility or an implementation of Apple File System on another platform. The on-disk data structures described in this document make up the file system; software that interacts with them defines corresponding in-memory data structures.
A filesystem is a specific type of database; or else filesystems and databases are different perspectives on a common concept we haven’t really clarified yet. This has fascinated me for a long time.
Reading through the APFS spec I see a lot of similarities to b-tree database engines I’ve read about, most strikingly LMDB with its use of copy-on-write b-trees, although I’m sure it’s also like engines I don’t know such as Postgres. (Apple’s previous filesystem, HFS, was also based on b-trees, which I believe was unusual when it was first designed c.1985. And APFS must also have been heavily inspired by ZFS, which Apple was planning to migrate to in the 00s.)
From this perspective it seems kind of perverse to me that so many databases are implemented as another b-tree layer stored within a plain old stream-of-bytes file in a filesystem! I know there are many practical reasons for this, but what if a filesystem could let userspace programs create entities that used its underlying b-trees to store key-value data? They’d be both files and databases. It seemed like BeFS was taking steps toward this, but I haven’t heard of any further advances in that direction.
(And then of course there’s the NewtonOS approach where there is no filesystem at all, just a database “soup”, but that was more of an oddball graph/object database.)
Several years after making the Newton object store, I found myself working on WinFS at Microsoft. The origin of WinFS was (as far as I could tell) Bill Gates having the same nagging thought as you: Why do we need both file systems and databases, don’t they do the same thing? The difference of course being that Bill was in a position to direct hundreds of developers to actually make a combination database and filesystem!
I doubt anyone will write the book of what happened to WinFS, but I’ll just say the clients of filesystems and databases have very different expectations about: throughput, latency, memory usage, caching, schema complexity, schema rigidity, namespacing, transactions…basically everything.
On the much lighter-weight end of the spectrum, most filesystems this century do have a little key-value store built into every file (“attributes”). The original design of NTFS is based on this — the “contents” of the file is just a potentially very large attribute. But I think they all eventually optimized for one big attribute and lots of small ones (citation needed).
In the end a bifurcation happened based on actual use cases. Everybody’s client OS implemented a background indexer for queries on derived file attributes (high latency but doesn’t get in the way of throughput or participate in transactions, flexible schema). On the server, NTFS and SQL Server folks implemented transactional semantics in the filesystem with the idea the two could cooperate (keeping the high throughput system for data but tying it to data transactions) — not sure where that ended up. And of course for putting schematized data in a file, we all have SQLite.
The layering in ZFS provides a transactional object store at the lowest level. I’ve been curious almost since ZFS was released whether you could build a database using that as the storage back end.
Hmm… The low level storage in Newton was a transactional block store based on the behavior of NAND flash (you can set all bits in a page to 1 bits, then selectively write 0 bits, and sometimes the write fails…). Thanks to my long-suffering coworker Landon Dyer, all this nonsense was hidden by an API that was basically new, read, rewrite whole block, and delete, based on opaque block IDs.
Then the higher level put an indexing structure on top of that, which could have been a normal hierarchical directory/file structure, but instead I went rogue and made it a document database: blocks with serialized objects (similar to JSONB) and blocks making up B-tree indexes (based on key retrieval similar to JSONPath).
ZFS dnodes (what files and directories are made of) look pretty similar to the low level. But I wonder if there are optimizations in there that make it so you can't "really" just make up arbitrary dnodes.
I doubt anyone will write the book of what happened to WinFS,
Probably true but definitely a shame. I would love to have books on all the big failed projects of the 90s/00s: Taligent, Netscape 5, WinFS, Midori…
the clients of filesystems and databases have very different expectations about…basically everything.
That was what I figured were the “many practical reasons” for the two staying separate. And yet, it’s kind of perverse that the filesystem's performance depends in part on its having almost no durability (let alone A, C or I) guarantees for file contents, which makes programs have to jump through hoops to update files safely, which leads them to adopt databases like SQLite as their file formats, thus outsourcing the hoop-jumping to the db engine.
Transactional NTFS is deprecated: https://learn.microsoft.com/en-us/windows/win32/fileio/deprecation-of-txf
While TxF is a powerful set of APIs, there has been extremely limited developer interest in this API platform since Windows Vista primarily due to its complexity and various nuances which developers need to consider as part of application development. As a result, Microsoft is considering deprecating TxF APIs in a future version of Windows to focus development and maintenance efforts on other features and APIs which have more value to a larger majority of customers
I think TxF originated as support for the SQL Server FILESTREAM type (basically a BLOB that is stored in the filesystem). Sounds like nobody else could figure out how to use it so it's essentially an "exposed internal API" at this point. (Similar to what happened with NT filesystem drivers.)
As I recall, it was the surviving bit of WinFS. I remember WinFS being quite exciting. It was one of the flagship features of Longhorn (the cancelled version of Windows between XP and Vista). Even as someone who had abandoned the Windows ecosystem, it was exciting.
WinFS promised to take the BeOS ideas of files supporting arbitrary structured metadata and provide transactional editing, CoW-based versioning at the filesystem level, and a bunch of more user-friendly abstractions than a hierarchical filesystem (which MS usability research had found most people didn’t use, they put all of their files in one folder and did versioning by renaming the file).
As with BFS and the more interesting features of HFS+ (metadata, forks) it was mostly killed by the rise of networking (though, apparently, it was also delayed as a result of being dog slow and partly killed as a result of the delays). An exciting new set of filesystem abstractions is great, except that they also have to work on SMB shares, NFS shares, and so on. NeXT learned that lesson early. They moved from the HFS fork model to the bundle model still used in modern macOS so that they could have local UFS on hard disks or optical disks and remote NFS with no loss of functionality.
Rich filesystems end up being a usability cliff. As soon as you need to store a file in a FAT-formatted external disk or a network share, or send it as an attachment, you get an abrupt reduction in functionality. OS X does some things to put metadata in hidden files but this then causes problems when some other OS copies the file but not the metadata file.
We hoped to address this in Étoilé by decomposing the use for a filesystem so that collaboration, versioning, and persistence were handled by our storage layer (which we aimed to eventually sink into the OS) but sharing and interoperability were done as an ‘export’ operation. Modern mobile operating systems are slowly converging on similar abstractions, so there’s some hope for interesting filesystem models in the future.
Indeed, I always blamed email for blocking all further development of rich filesystems by introducing the expectation that every “item” you might want to attach has to be a single binary stream. The earliest example being that Mac apps used resource forks and type/creator tags until email and internet file exchange made them untenable (despite MacBinary).
There were holdouts, like until surprisingly recently you couldn’t email a Pages document using anything but Mail.app, because they were bundles (i.e., directories) rather than files. But even the Apple apps eventually gave up.
If there were an expectation that an import/export should happen at the exchange boundary (e.g., when you drag an item into a mail message) we’d have a hook to use, but that never happened. So the only usable form of exchange is a binary stream with a name that has vague type metadata appended to it (the extension), and the filesystem has to store literally that; any other data/metadata must be considered ephemeral.
Fortunately bundles can be easily transformed into a single stream and back.
They can be, but they aren’t. Drag a bundle into your Gmail message and see how well it works! And I didn’t mention the other killer, cross-platform (or even same-platform-but-cross-filesystem) compatibility. Due to ecosystem pressure, the vast majority of apps and sharing mechanisms (email, Dropbox, USB sticks) converge on using no more than FAT32 semantics for the units of sharing.
At the time (c.1999) I was arguing for never transforming them back to directories. That is, let a bundle be a file in some archive format like Zip or DMG and add abstraction at the app API level, like NSBundle, to make it appear to be a filesystem.
There was one experiment to try this out, but they were literally mounting the DMG as a filesystem at app launch, which naturally slowed down launch time a lot, so it was vetoed. A shame.
Hey now that sounds pretty awesome though. One of the worst parts of databases IMO is that you can't keep all your data managed by the database, you have to store paths to files in the database and manage those files externally if you wanna store anything remotely BLOB-like. I'd love to let the database take care of those files so that I don't have to think about where in the filesystem they should go and how to keep the filesystem in sync with the database.
Though... you could probably just build similar RDBMS features on top of normal UNIX-style filesystem APIs.
The designers of NTFS and BFS came to the same conclusion but approached it from different ends: the core functionality of a filesystem is a key-value store where some values are big and some are small.
BFS had an extensible inode format that stored small values in the inode. A tiny file on BFS uses one sector: the body of the file is stored inline in the inode. NTFS instead designated a region of the disk for storing small objects and provided an efficient way of writing small things to that space.
It also seems interesting to me that most LSM backed data stores like LevelDB were built on top of a file system back when spinning disks were the main form of persistent storage. From my understanding one of the benefits of LSM trees is its ability to take advantage of sequential IO (something that would be great for spinning disks). However, I haven’t seen any work talk about the interaction with the file system since files don’t have to be written contiguously, so your LSM tree might be doing more random IO than sequential. Though maybe this is not exactly a problem in practice.
I think the framing of "a common concept we haven't really clarified yet" is exactly right.
Very commonly, this gets mistaken for either (a) these two things are in fact the same thing or (b) one is an instance of the other (can go either way). But really they are probably both more specific instance of a shared superclass/superconcept/interface.
To me it looks like the WinFS story (possibly BeFS as well) largely consists of falling into that trap.
To me there are two parts of this:
Interface polymorphism. These can probably largely handled by a common interface. The Objectve-S storage combinator interface (talk, pdf) does this.
Implementation polymorphism and sharing. It certainly seems like a lot of that waste both of reimplementation and of layering could be avoided if both the filesystem and database(s) were constructed in such a fashion that they could share a common superclass and common shared components.
I’m glad this is (finally?) documented, but I’m still sad they didn’t adopt ZFS instead. That said they did an amazing job doing the live transition to Apple FS, and I wonder if that wasn’t some of the reasoning for not adopting ZFS, the seamless conversion from HFS+ being too much of a headache.
My outsider understanding of ZFS is that it is "licensing nuanced," since [its license]((https://github.com/openzfs/zfs/blob/zfs-2.4.0/LICENSE) CDDL-1.0 is deemed as OSI-compliant but also adjacent to Oracle and apparently has some goofy interaction with GPL (which is why it doesn't ship with Linux). I doubt Apple cares about GPL (and may even consider that a plus in their world) but the adjacent to Oracle part is probably clearly in the camp of "not worth the hassle"
My experience with companies is also that they gravely prefer to control their own destiny, and thus trying to implement "someone else's filesystem" goes against the NIH tendencies, so presumably that's strike two against any standardized FS
I heard some rumours that Oracle tried to extort Apple for license fees, in spite of the open-source license (note: the problem with the license is strictly GPL incompatibility, it has no issues with any F/OSS license that doesn’t have GPL-like terms that cover the entire work). Apple’s reaction was to walk away.
That said, much as I like ZFS, a lot of the benefits show up only when you either have multiple disks and / or multiple computers. Apple’s primary requirement was for single-SSD devices (iPhone up to MacBook Pro).
I'm so happy they didn't go with ZFS. The less Oracle software we as a species depend on, the better.
But it's not Oracle software? Oracle happens to own a fork of ZFS, but nobody uses it, including Oracle(except for one product that's mostly dead). Oracle didn't invent or come up with ZFS, and it was open sourced before Oracle got a closed-source, mostly irrelevant fork.
I guess technically we should bulk replace every instance of ZFS on the internet with OpenZFS to be more correct, but 99.99% of the time when anyone is talking about ZFS, they are talking about OpenZFS, which has nothing to do with Oracle.
OpenZFS: https://openzfs.org
As far as I'm concerned, ZFS is an Oracle product, OpenZFS and ZoL are utterly tainted by Sun and Oracle.
The only thing that could convince me otherwise would be if OpenZFS and ZoL was to re-licence under a license that's not intentionally GPL-incompatible. But that's never going to happen, because Oracle would never permit it.
Oracle is far from the only copyright holder of code in OpenZFS. And the folks who made large contributions to the codebase to improve ZFS on FreeBSD and Illumos have zero incentive to want to relicense their code for the benefit of the one kernel that chose a license that prevents code moving from it to other places.
Regardless, even if everyone who is part of the OpenZFS license wanted to relicense, they could not without Oracle's permission. That makes it sufficiently owned by Oracle for me to stay as far away as possible.
The same is literally true of the Linux kernel, which includes code owned by Oracle. If everyone not Oracle wanted to change the license of the Linux kernel and agreed on the license that they wanted to change it to, they could not do so without Oracle’s consent unless they rewrote all of the Oracle-owned bits. Similarly, Oracle cannot unilaterally change the license of the Linux kernel or of OpenZFS because a load of the copyright is owned by other people.
Contrast this with MySQL, where all of the copyright is owned by Oracle and they do routinely provide it under alternative licenses.
I'm certain some non-technical reasons were involved, but I was ignoring them to stay within a technical perspective, since this is Lobste.rs.
I wish they'd gone with something different because APFS is now the only way (that I've found / read) that you can create encrypted volumes and also has absolute dog's arse performance on spinning metal (because it was built and optimised for SSDs.) Which means all my external disks are now extremely annoying to deal with.