Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Durability: Linux File APIs (evanjones.ca)
97 points by zdw on Oct 12, 2020 | hide | past | favorite | 39 comments


POSIX's atomicity guarantees relate to a running system; durability is about systems that are shut down (possibly unexpected and in an uncontrolled manner). POSIX is carefully worded to not give any durability guarantees. Naturally, writes are generally not atomic w.r.t. unexpected system maintenance and there are no ordering guarantees at all unless you give ordering hints (like fsync/fdatasync/sync_file_range) yourself, which the system (kernel, filesystem, hardware) may or may not honor.


> Durability is about systems that are shut down (possibly unexpected and in an uncontrolled manner). POSIX is carefully worded to not give any durability guarantees.

What part of the POSIX specs are you reading?

"The fsync() function shall request that all data for the open file descriptor named by fildes is to be transferred to the storage device associated with the file described by fildes. [...] The fsync() function is intended to force a physical write of data from the buffer cache, and to assure that after a system crash or other failure that all data up to the time of the fsync() call is recorded on the disk."

https://pubs.opengroup.org/onlinepubs/9699919799/functions/f...


All the data may be on the disk, but (last I checked) POSIX didn't require that filesystems not get completely corrupted in the event of a system crash. Obviously it would be nice if filesystems were not corrupted by a system crash, but I'm pretty sure POSIX doesn't guarantee it. And just because the data is on the disk, that (unfortunately) doesn't mean that you'll necessarily be able to find it.

(shades of MS Fnd in a Lbry by Hal Draper here)


This is a bit like how a mail service might guarantee they'll deliver your mail to a safe mailbox at that address but not guarantee anything about who is at that address or whether they'll be reachable at any point in time. The only thing this establishes is that there are more issues to worry about than the one I was replying to.

P.S. You don't even need a crash to induce the condition you're describing. Just call unlink() from another thread and suddenly fsync() will no longer be able to (and shouldn't) guarantee reachability...


It's silly that this is being downvoted.

Of course POSIX doesn't "require that filesystems not get completely corrupted in the event of a system crash", in the same way that it doesn't require that your RAM isn't faulty or that the computer manages to turn on at all after the next reboot.

A software specification and its guarantees only make sense in the context of properly functioning hardware and surrounding software.

In this case, that translates to "hardware doesn't randomly lose data" and "file systems do not randomly corrupt themselves", because that is the fundamental job of those systems.

fsync() is about durability, telling the POSIX-implementing OS interface to instruct the underlying software (e.g. file system) and hardware to sync the described volatile data into persistent state. For a system running on a ramdisk, that naturally means a no-op; for a network mount it means a roundtrip; for a PC with a hard disk it means that the described data is now on disk, and that it is there, uncorrupted, after an ordinary power failure. If you have broken RAM, a disk that lies about flush commands, or synced only a part of the intended data (forgotten to fsync the dir), then the POSIX system has still done its job regarding durability, and fulfilled its guarantees.


> All the data may be on the disk, but (last I checked) POSIX didn't require that filesystems not get completely corrupted in the event of a system crash.

It does. It specifies how the filesystem should behave in a particular (non corrupted) way. There is no exception that this requirement is somehow voided if the system wasn't cleanly shutdown.


How can software possibly be expected to ensure what a physical device returns after the system restarts?

I would guess that 'intended to' is there very specifically to ensure that this statement really means... nothing at all. It can fail but if it 'intended to' succeed that's enough.

If your system crashes due to a power surge that breaks your hard drive... guess what it isn't going to give you back the bytes you wanted.


> How can software possibly be expected to ensure what a physical device returns after the system restarts?

By giving instructions to the device that the device understands and provides guarantees for...


On hard drives, even SATA ones this is mostly sorted out since a few years (Linux doesn't use FUA over SATA, leaves write-back drive cache enabled, and just issues a full cache flush, which the big majority of drives respect, more or less). SSDs, less so, fsync is a rather relative affair on some of them.

IMHO, from having spent a fair amount of time on this, durability is a "fluid" concept, especially on desktop hardware (Microsoft had the right idea when they built Win Update on TxF), so if the data is important enough to worry about it, you should architect the system in a way that the durability of a fildes or drive is irrelevant to the overall system.


Right, but that's only because drives are lying at that point (or implemented so incompetently they have no idea they're lying...). Linux is trying to force liars to do what it wants anyway, and hence as a practical matter, programs need to worry about this too. This doesn't mean the spec is somehow broken or that honest (and competent) implementers couldn't guarantee durability if they wanted to.


How can the device provide those guarantees? If the system crashes hard in a power surge and the head crashes there's not going to be any use in claiming 'but the API guaranteed it'. There's no true durability.


You make it a requirement for some certification that the device maker wants to get.

I once wrote the Novell NetWare drivers for a SCSI host adapter vendor, and part of Novell’s certification testing was to run I/O heavy tests on a system that would randomly cut power to the drive or the computer and make sure that this did not corrupt things beyond what NetWare could fully recover from.

Hardware vendors like host adapter makers and drive makers wanted Novell certification, so made sure that was possible.

They might also provide settings that sacrificed that for more performance, but they would make sure you could get decent performance using the full durability settings.


> How can the device provide those guarantees? If the system crashes hard in a power surge and the head crashes there's not going to be any use in claiming 'but the API guaranteed it'. There's no true durability.

By your logic there is no such thing as a guarantee in this world, period. A meteor could crash into Earth and wreck everything.


> Very astute observation.

You're being snarky. That's not allowed in the rules here and it's unkind even if it wasn't.

> By your logic there is no such thing as a guarantee in this world, period.

So what does the guarantee really mean?

That the data will be on disk... as long as the crash isn't too hard? What do you usefully do with this guarantee? How is it better than no guarantee?


Okay, I removed that phrase, but what do you want me to say here? Entire disciplines have devoted decades of research into providing useful guarantees and yet you're here observing that no human could possibly prevent some disasters... and using that logic to just trash everything and everyone's associated work along the way. I mean, you're clearly technically right that there is no such thing as an absolute guarantee, but... what am I supposed to do with this assertion? Find it insightful? Give up on the notion of a guarantee? Declare everything is useless? What response were you hoping for that you're still not getting?


Why do you think the phrasing 'shall request ... is intended to' means there? It's a trapdoor for 'actually guarantees nothing', right?


It means: "In case we didn't get the point across in the technical language above, this is what we're trying to say in plain English."

It does not mean "we're just writing this superfluous text to clock in extra hours or pass a word count threshold".


> It does not mean "we're just writing this superfluous text to clock in extra hours or pass a word count threshold".

Where have I said I thought it was about writing time or a word count? You've imagined that.


Well I can't guess what else you imagined it to mean besides those, but I hope it's clarified now.


Not really - the language still seems super vague - intentionally so and thus useless. But thanks for trying despite your for some reason bitter snark and sarcasm!


I'm still lost what kind of a response you were hoping for that you didn't already receive, but you're welcome.


I take it to mean that absent some unexpected, external event (like a head crash, flood, meteor, EMP, etc...) data will be on disk.


The device would guarantee that the data is safe as long as it doesn't get physically damaged. That's an extremely useful guarantee. Power outages are common and almost never cause physical damage.

A flash device that isn't designed carefully enough could lose multiple megabytes of semi-random sectors if it gets interrupted at the wrong time, as a consequence of giant erase blocks.


The language you are quoting is quite clear that the behavior you are thinking about is implementation-defined, and the informative section clarifies that this is one-hundred percent intended.

> It is explicitly intended that a null implementation is permitted.


No, a null implementation is permitted because write() could potentially already provide these guarantees internally, at which point fsync() would be able to satisfy all requirements without needing to do anything. That sentence does not mean you could take any arbitrary system and slap a no-op fsync() onto it and somehow still maintain these guarantees.


The next sentence in the document would have told you that this conception is simply wrong. fsync does not give you durability guarantees. POSIX is worded such that a system with no durability can be compliant.

> It is explicitly intended that a null implementation is permitted. This could be valid in the case where the system cannot assure non-volatile storage under any circumstances or when the system is highly fault-tolerant and the functionality is not required. In the middle ground between these extremes, fsync() might or might not actually cause data to be written where it is safe from a power failure. The conformance document should identify at least that one configuration exists (and how to obtain that configuration) where this can be assured for at least some files that the user can select to use for critical data. It is not intended that an exhaustive list is required, but rather sufficient information is provided so that if critical data needs to be saved, the user can determine how the system is to be configured to allow the data to be written to non-volatile storage.

Again, the way the normative language you quoted above is written makes it abundantly clear (for a standard) that fsync doing anything is entirely optional.


Did you read the first sentence in that paragraph? That entire paragraph is talking about when _POSIX_SYNCHRONIZED_IO is not defined:

> If _POSIX_SYNCHRONIZED_IO is not defined, the wording relies heavily on the conformance document to tell the user what can be expected from the system. [...]

If you check your Linux system, you will see _POSIX_SYNCHRONIZED_IO is defined:

  printf "%s\n" "#include <unistd.h>" "#ifdef _POSIX_SYNCHRONIZED_IO" "#error _POSIX_SYNCHRONIZED_IO is defined" "#endif" | gcc -fsyntax-only -x c -
and hence the rest of that paragraph isn't talking about this situation.


Somewhat relevant is this article, https://danluu.com/file-consistency/ which was discussed here a while, back.



Thanks for your pointer. Do you mean this thread? "Json-Base – Database built as JSON files | Hacker News" https://news.ycombinator.com/item?id=23715558


I was thinking about this dedicated post, https://news.ycombinator.com/item?id=21674610 but I think I saw it multiple times here over the years.

(I think there was at least one more article by danluu on this subject but this one I remembered by name, having discussed it several times with colleagues when considering using something (sqlite) less ad hoc for temporary data)


> Does that mean write is atomic? Technically, yes: future reads must return the entire contents of the write, or none of it. However, write is not required to be complete, and is allowed to only transfer part of the data.

This is technically right interpretation of what POSIX says but totally ignores the fact that in such state the application will know that this happened and should handle that. Because in the first place the rest of the write request did not happen at all and it will not happen unless the application retries the write(2) with rest of the data. Most applications simply ignore this situation which is the reason that this situation is guaranteed to not happen on any even vaguely unix-derived OS (and exactly this is the number one reason for error messages of the "Error: Success" kind).


Right, I think the point is that you no longer have atomic semantics for the write as a whole.


Well until people's file descriptors happen to be pipes or sockets.


and then there's mmap where you can observe changes at byte-level granularity.


> In particular, when you first create a file, you need to call fsync on the directory that contains it, otherwise it might not exist after a failure.

Suppose you have created a file in a directory that is not readable. How do you get a file descriptor for the directory to call fsync on? It's an error to try to open a directory for writing.


Hm, from what I can tell this doesn't seem to be possible.

There is one way to open a directory without read permissions, which is with O_PATH, but the resulting fd can only be used "to indicate a location in the filesystem tree and to perform operations that act purely at the file descriptor level". And indeed if you try to use this fd to fsync() you get EBADF.

I guess this is just an edge case of an edge case that most people aren't concerned about. One way to work around this issue with a bare minimum of compromise to security would be a setuid program that does nothing but open its working directory then fsync it.


Programs that care about durability should use `openat(int dirfd, char *pathname, ...)` to get an FD on the file after after getting an FD on the parent dir anyway, instead of using `/path/to/dir/file` style paths, because otherwise there's always the chance that something in hierarchy was moved away or mounted over.

In that case, you always have a handle on the directory.

Opening the dir for fsyncing separately after having written the file is similar to doing close()+re-open()+fsync() on a file, which does not provide as strong guarantees as direct open()+fsync(): https://stackoverflow.com/questions/37288453/calling-fsync2-...





Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: