Things that go wrong with disk IO
27 points by eatonphil
27 points by eatonphil
This gets even weirder in hyperscaler clouds where it’s like.. what does it even mean to fsync? We were on a call with firmware engineers at GCP at some point where they said essentially: when you fsync, definitely that has nothing to do with if anything was written on disk, it means the battery-backed bit of hardware has your write in RAM and will start working on getting it over the network (and if you read it back to confirm it was written we will serve it from that RAM) but: if the hardware dies and we lose your data, we will make the whole disk unmountable. So, you will never lose a single block if you’ve had a successful fsync, you will lose the whole disk. This way you don’t have the problem of recovering from a crash and starting on a valid but outdated disk state, you will realize data was lost.
We’ve spent a lot of time on similar aspects of the design of crucible, our EBS-like block substrate, at Oxide. We don’t currently put battery-backed RAM in each individual hypervisor host machine like that: our cache is volatile, and we require a quorum of backend disks (all on different machines) to acknowledge the flush barrier as durable before we tell the guest it’s complete. Asynchronous writes (i.e., the ones you make between flushes) can be immediately acknowledged to the guest as they are initially transmitted to the backend, because there are no durability guarantees until you subsequently do a flush.
This means everything can be interrupted (e.g., power loss) or any individual thing can break (e.g., a disk failure) without dropping any acknowledged writes, and without quite as many ways that a single piece of hardware could result in the entire virtual disk going up in smoke. But it does represent a performance challenge we’ve had to work on, as it means acknowledgement time for flushes has a pretty direct impact on guest visible latency for things like fsync.
A discussion we often had was around: we are running 3 or 5 member clusters of databases for every production deployment, and then we are also paying for EBS/GCP Persistent Disk to triply replicate everything, at an enormous latency cost vs the “local disk” NVMe offferings..
I left the DBaaS space a few years ago, so not sure how it planned out, but the plan was to move away from replicated disks and use the local disk options instead, since the clusters themselves handle disk loss already, and we run the clusters either way
Yes, local disks without the various replication taxes is something we’ve discussed. Our biggest challenge with it is that if you only have one replica, then when the machine that houses the storage is offline (due to hardware or power issues, a host OS panic, or during a reboot for a host OS update) the VM that uses that disk stops working. We can shuffle the VM itself around with live migration, but obviously the storage at rest is an issue.
If you go further than just a single network accessed disk replica, and say the data must truly be colocated on the same machine that actually runs the VM, then you have another problem. Now you can’t easily move the VM itself to another machine (without also migrating all the disk data), which means you can end up with bin packing and resource siloing and starvation issues.
I definitely agree there are use cases, in particular the kind you mention where you’re doing your own fault tolerance inside the guest. But not every application that runs in a VM is tolerant of that sort of interruption, which is why we focused on durability and then performance of replicated disks first.
Will GCP offer the ability to recover the valid data, or are you totally 100% hosed?
This was specifically for their “local” NVMe offering, and at least the people I talked to were quite clear that would be the end of whatever was on that disk. This was years ago though, may have changed.
Postgres, SQLite, MongoDB, MySQL fsync data before considering a transaction successful by default. RocksDB does not.
To be fair, nobody runs RocksDB stock, you can enable fsync writes which any sane database will typically do unless it doesn’t guarantee persistence.