We should improve libzfs somewhat
14 points by tomhukins
14 points by tomhukins
FWIW, libzfs.so.1 (and its companion, libzpool.so.1) is definitely not the application interface. It’s just a private implementation detail of zpool(8) and zfs(8), and potentially some syseventd(8) and fmd(8) modules. It’s definitely not a Committed interface, and shouldn’t become one.
The libzfs_core.so.1 interface is, as noted in the article, intended to provide a Committed interface with API and ABI stability. It’s not clear why work on it essentially stalled, but I think the answer is probably that good stable library design is difficult, and in the case of ZFS generally the commands are actually enough.
The commands have stable, documented output. There’s an easily machine-parseable way to get details about datasets, snapshots, pools, etc (this was true long before the JSON work was done, too). I have worked on lots of programs that automate the management of ZFS, and I don’t think that running the tools (rather than using a library) is actually an impediment. In fact, it has some key advantages; e.g., you don’t need to give your management process the elevated privileges it would need to make ioctl(2) calls to the kernel to manipulate the file system, you can instead use a facility like rbac(7) profiles and pfexec(1) to provide relatively constrained access, such as via the ZFS File System Management or ZFS Storage Management profiles.
I think nvlists are actually a very reasonable way to express an interface that has such complicated requirements. They’re strongly typed, and they’re amenable to ergonomic interfaces in higher level languages like Rust (you could serialise and deserialise them with serde, for example) or Go or Python or whatever else. They also make it much easier to provide ABI stability: ensuring C structs don’t accidentally change shape over time is notoriously difficult, as the source doesn’t show you implicit things like padding, alignment, or the overall size of the struct, and you can never reorder the fields in the source even if it would help make it clearer. They’re also self-describing, which means you can easily dump them out in the debugger; e.g., MDB has an ::nvlist dcmd that can, from just the pointer, show you the contents of a given nvlist_t.
I was pretty uncomfortable with channel programs when they were integrated originally, and I’m not sure my feelings have improved over time. They allow a turing complete program to be injected into the kernel and run in a relatively high priority context, and there’s essentially (by construction) nothing you can do to make that both completely safe against, say, infinite loops, and also correct (e.g., by not selecting an arbitrary timeout after which you just abort the thing).
The article also seems to conflate naming the arguments in C header files with documentation, but the names are intentionally absent from the headers: they’re not syntactically necessary, and putting names in the header both makes the header bulkier than it needs to be, and also provides a source of “split brain” between the declaration in the header and the actual definition of the function somewhere else. Documentation for these sorts of libraries is provided in manual pages, rather than in the header.
As an aside, we do not capitalise “illumos” – and the reference to “old Solaris” made me think of the story about Cary Grant receiving a telegram (billed, ostensibly, by the word) from a reporter that inquired HOW OLD CARY GRANT?, to which he replied: OLD CARY GRANT FINE. HOW YOU?
We’ve been operating ZFS fileservers since circa 2008, and for almost all of that time we’ve had (and needed) a private program that dumped (most) nvlist information for a particular pool. We did this for multiple reasons: the full collection of information we needed wasn’t available from ‘zpool status’ before the JSON output was added, we experienced situations where ‘zpool status’ would stall when there were disk problems, and parsing our program’s output in higher level (Python) management code was much easier than pre-JSON output.
(Even today I don’t know if ‘zpool status -j’ is committed to providing all information about the pool and its vdevs, or merely the information it thinks we need, and I haven’t cross-compared what it reports against what our tool does. Anyway, the JSON output option is too new to be in even the Ubuntu 24.04 ‘zpool status’, so we’re stuck without it for probably three or so more years.)
I’m curious to know what you were retrieving via nvlist fishing that you didn’t have through other avenues!
I hear you on things hanging when there’s I/O trouble. ZFS has a lot of intricate locking structures and the I/O paths sometimes do too much work with too many locks held. We’ve had issues bringing patches back from OpenZFS into illumos which have exacerbated similar problems around paging out memory into a ZVOL swap device; apparently the Linux folks don’t really do that, so they aren’t concerned with allocating memory while holding all kinds of locks, which creates problems (deadlocks, usually) under memory pressure.
Failing disks are also just challenging in general; they often just hang, rather than explicitly failing I/Os, so you have to forcefully retire them – but knowing when to do that, without causing even more problems as a result, is sometimes more art than science and ultimately risky.
The 2014 OmniOS/illumos version is available at this link as zpstatus.c, along with some programs (at the time) that uses it. More context is in this blog entry on our spares system as of then. Things have evolved and changed somewhat since then, with the current zpstatus version reporting more information as it’s become available in nvlists and we’ve noticed and decided that we cared, but I don’t have that published anywhere at the moment. (Also, these days we no longer use iSCSI, which may lessen the chances of ‘zpool status’ stalling.)
To add a note: my now-vague memory is that ‘zpool status’ stalled when our zpstatus nvlist dumper did not, presumably because ‘zpool status’ (at the time, on Solaris 10 and then OmniOS) was attempting to helpfully determine additional information about eg disks that would be useful for humans but which we did not need and did not want to stall status reports for. Since we were using this as part of fault handling, it was absolutely critical that our tool for obtaining ZFS status information did not stall or stalled in as few situations as possible.
I like this post, but for someone who has such attention to detail as to be able to comment on C library design, I’m disappointed that the author consistently can’t spell the word “its”.
Author here. Sort of a weird criticism, but also you’re not wrong! For whatever reason I’ve just never been able to internalise the rules properly. It’s kinda hit and miss whether or not I get the right spelling for the job, and it’s definitely more miss than hit when I’m tired (which I am this month; lots going on). Good feedback though, it is pretty bad! I’ll take a sweep through it over the weekend. Cheers!
Just remember Strong Bad’s helpful song:
If you want it to be possessive, it's just 'I-T-S'
But if it's supposed to be a contraction, then it's 'I-T-apostrophe-S'
Scalawag
In case it helps, I read https://en.wikipedia.org/wiki/Eats,_Shoots_%26_Leaves a while ago and it had an explanation that stuck with me: “yours”, “his”, “hers”, “theirs”, and “ours” are all possessive, and all have no apostrophe. Likewise “its”. Thus, “it’s” is the contraction of “it is”.
Of course, the downside of internalizing this is spotting the mistake in everything you read!
Ahh, I hadn’t heard the comparison with “yours” etc. Maybe I can make that stick, thanks!
I think at least part of the problem is that I really struggled to learn punctuation rules as a kid, and so needed a few basic rules drilled into me, one of which was that you always add the apostrophe to a person’s name to indicate possession. I suspect some wetware rule of “add ’s to the object to indicate possession” or something like that is getting involved for “its” and messing me up. That stuff is hard to retrain.
Anyway, I’ve looked at every \bit.?s\b
and sounded out “it is” or “it has” in my head, and I think they’re all right now. Oof.