SurrealDB is sacrificing data durability to make benchmarks look better
27 points by av
27 points by av
when I ask the OS to write my [data] to the file, the OS will … [write that data to the] file cache [and will] tell me everything is done before it actually writes (ed: flushes) [that] data to the underlying storage.
For sure! And that’s because flushing thru to underlying storage is usually several orders of magnitude slower than just writing to the FS cache alone.
this behaviour … causes us issues if we want to ensure that when we make a change to a file, the data gets to the disk and won’t disappear if we lose power, or even if the next flush errors!
The issue is that fsync doesn’t actually guarantee these things. Calling fsync guarantees that the FS asked the underlying storage to write whatever data it provided, and that the storage responded “OK” – but that doesn’t mean much! Consumer HDDs often just put fsync’d data into volatile write-back caches, which would be lost after a crash. NFS under various configs will respond OK to an fsync before the write hits any kind of disk. Many (most?) cloud storage filesystems define fsync at a hypervisor boundary, well before any physical disks get involved. And so on.
Durability is a spectrum, fsync is stronger than not-fsync, but in no way is it a guarantee of durability.
Author here,
For sure! And that’s because flushing thru to underlying storage is usually several orders of magnitude slower than just writing to the FS cache alone.
This is indeed the case for most applications, which is why we have a cache there, as I said in the post, but devices have come a long way, and latency and throughput are now much closer. That being said, your application needs to be very focused on actually making use of that and optimising in order to achieve it, but it is possible.
A great example is my raid with 2 Gen4 NVMe’s the latency of writing to the OS with the file cache is only about 30 microseconds faster than issuing an O_DIRECT write with uring. So we’re slowly bridging the gap :)
The issue is that fsync doesn’t actually guarantee these things. Calling fsync guarantees that the FS asked the underlying storage to write whatever data it provided, and that the storage responded “OK” – but that doesn’t mean much!
I think this is very dependent on the hardware (and also a little bit on the file system). I will say at a certain point, you’re putting your faith in the hardware to do the right thing.
As far as I am aware, on Linux it will issue a partial write cache flush on the NVMe device, flushing any volatile buffers that a lot of consumer NVMe devices use. macOS, I believe this requires F_FULLSYNC
to exhibit the same flush behaviour.
If your NVMe says “yep data is safe” when it actually isn’t, well, not a lot you can really do there other than get better hardware.
I imagine the same behaviour is implemented for spinning disks and other SATA devices, but whether they actually listen to it, hard to say.
I can’t comment Re:NFS behaviour, it isn’t something I’ve really worked with or looked at the internals of, but cloud storage systems, I think this is not true nowadays, it seems on AWS at least that EBS ensures durability before acknowledging writes without needing to issue a FUA:
All writes to EBS are durability recorded to nonvolatile storage before the write is acknowledged to the operating system running in the EC2 instance. fsync() does not necessarily force unit access, and explicit FUA / flush / barriers are not required for data durability (unlike some storage devices or cloud storage systems). Perhaps there was confusion about the question that was asked.
I think this is very dependent on the hardware (and also a little bit on the file system). I will say at a certain point, you’re putting your faith in the hardware to do the right thing.
Right, but, this is my point! 😅 You’re always putting your faith in something to do the right thing, according to your definition of the right thing. Using a normal write means you’re putting your faith in the FS cache. Calling fsync after every write means you’re putting your faith in the underlying storage driver. Using O_DIRECT means you’re putting your faith in the specific device’s DMA queue or whatever. These are all points in the risk spectrum, each with different costs and benefits. Of course none of them actually guarantee what you actually need, which is data durability across system restarts/crashes. They can only reduce the risk(s) related to those conditions in their own specific ways.
I will say at a certain point, you’re putting your faith in the hardware to do the right thing.
I learned a lot from Tigerbeetle’s approach to fsync problems and durability, they specifically call out other problems with fsync from Can Applications Recover from fsync Failures?
They take advantage of the fact that they’re running a distributed system. Specifically TigerBeetle “manages its own page cache, writes data to disk with O_DIRECT and can work with a block device directly, no file system is necessary” and “uses Protocol Aware Recovery to remain available unless the data gets corrupted on every single replica”.