Designing microkernel IPC
18 points by Shorden
18 points by Shorden
This is an okay initial write up, but skims over a lot of important details in this design space:
You can build call-return semantics over one-shot synchronous calls, but it’s not trivial to handle failure semantics. Spring put a lot of effort into the shuttle model to avoid this, so each doors call registered a return point in the kernel. Consider the case that A calls B, B calls C. While C is processing the call, B crashes. What do A and C do? The Spring Doors paper spends a lot of time discussing this and the solution that you pick has a massive impact on the programmer model. From the example in the short code snipped in the article, consider what happens if the callee crashes on process_message. There’s no timeout provided in ipc_receive, so the caller is now blocking forever (or maybe notified that the mailbox is gone?). If you add a timeout, how do you differentiate callee-crashed and timed-out failures?
The key observation from the L4 folks on synchronous vs asynchronous was that it’s much easier to build asynchronous messaging on top of synchronous than vice versa (assuming you have shared memory). You can create an in-memory ring buffer and use synchronous messaging to notify the other side to wake up. Xen did exactly this: the lowest-level messaging is a simple interrupt-like model, the layers on top build lock-free ring buffers and use the low-level part for wakeups. LionsOS does the same, providing generic rings for asynchronous messaging atop seL4s synchronous message passing.
It looks as if the author missed that (I’d really recommend the early L4 papers, they discuss this in detail) and went with asynchronous.
The other thing missing is a discussion of the interaction with the scheduler. Asynchronous IPC usually suffers from latency issues because you push a message into a queue and eventually the server is scheduled, runs, puts a response in the queue, then you wake up and get the response. The other key innovation from Spring Doors was to make the message send not be a scheduling event. When you sent a synchronous message via a door, the callee woke immediately. There was no scheduler interaction and the server then ran with the caller’s scheduler quantum and inherited their priority.
The last point also avoided a load of priority inversion issues, which are important to consider in any multi-server microkernel. If a high-priority process sends a message to a lower-priority process, how do you avoid a medium-priority process causing starvation?
The simplest notification is a boolean flag in a mailbox. Notifying can be done by setting the flag to true and thus it never blocks. You can't know how many times the sender notified you. All you know is the peer wants you to do something.
Missed opportunity to call it an interrupt.
This is what I call the notify & pull pattern. The server notifies the client, and the client pulls the data using message passing.