Indeed, sparse files are simply a mistake to have included in Unix in the first ...

retrac · on Dec 26, 2023

Sparse files make more sense if you see the file system and paging as unified. If you have allocated an array of 1 billion items, accessing the last item doesn't make the OS zero out everything from 0th to the billionth item, allocating millions of pages along the way. Virtual emory is sparse; so just one page of virtual memory is allocated. Mmap'd sparse files behave the same way.

ajross · on Dec 26, 2023

No, I get it. I'm saying that's a bad design. The data structure for a VM system is a big tree of discontiguous mappings, which matches the API used for accessing it. If you make a random access to memory at an arbitrary spot, you expect to get a VM trap. If you want to map memory, you're expected to know the layout and manage the "holes" yourself (or else to let the OS manage your memory space for you).

The data structure for a file is an ordered stream of bytes, which matches the API for accessing it. You can jump around by seeking, but there are no holes. Bytes start at 0 and go on from there. Want to seek() to an arbitrary value? Totally legal, presumptively valid.

Making the filesystem, implemented from first principles to handle the second style of interaction, actually be implemented in terms of the first under the hood, is a source of needless complexity and bugs. And it was here, too.

avianlyric · on Dec 26, 2023

> Making the filesystem, implemented from first principles to handle the second style of interaction, actually be implemented in terms of the first under the hood, is a source of needless complexity and bugs. And it was here, too.

Aren’t all modern file systems implemented as a tree of discontinuous regions? That’s the whole reason block allocators exist, why file fragmentation is a thing (and defragmentation processes).

How could you reasonably expect to implement a filesystem that under hood only operates with continuous blocks disk space? It would require the filesystem to have prior knowledge of the size of all the files that going to be written, so it can pre-allocate the continuous sections. Or the second writing a file resulted in that file exceeding the length of the continuous empty section of disk, future writes would have to pause until the filesystem had finished copying the entire file to a new region with more space.

With ZFS its heavy dependence on tree structures of discontinuous address regions is what enables all of its desirable feature. To say the complexity is needless is to implicitly say ZFS itself is pointless.

p_l · on Dec 26, 2023

The issue is that pretty much all other filesystems at least on Linux, are effectively implemented as swap filesystem drivers with some hierarchical structure on top, because that's the interface pushed by Linux at kernel level.

In userland, we tend to think of streams of bytes, as provided by original Unix and as all the docs teach us to treat them - that read(), write() are the primitives and they do byte-aligned reads and writes.

Except the actual Linux VFS has, as its core primitive, mmap() + pagein/pageout mechanism, with read() and write() being simulated over the pagecache which treats the files as mmap()ed memory regions. It's how IO caching is done on Linux, and it's source of various issues for ZFS and people using different architectures because for a long time (changed quite recently, afaik) Linux VFS only supported page-sized or smaller filesystem blocks. Which is a bit of a problem if you're a filesystem like ZFS where the file block can go from 512b to 4MB (or more) in the same dataset, or VMFS which uses 1MB blocks.

avianlyric · on Dec 26, 2023

What any of that got to do with the bug described in the article? Presumably every filesystem is responsible for tracking the content of sparse files, and where holes are. That's not something the Linux kernel is going to give you for free, the FS needs tell the kernel which pages should be mapped to block address on disk and which pages should be simulated as continuous blocks of zeros with no on-disk representation.

p_l · on Dec 27, 2023

It's related to the talk about filesystem interface metaphors in this specific subthread :)

ajross · on Dec 26, 2023

That's true of a storage backend, but not the metaphor presented. Again, the analogy would be a heap: heaps are discontiguous internally too, but you don't demand that users of malloc() understand that there can be a hole in the middle of their memory! Again, the bug here was (seems to have been, it's subtle) a glitch in the tracking of holes in files that didn't ever need to have been there in the first place.

avianlyric · on Dec 26, 2023

But ZFS doesn't demand that users be aware of holes in files. You can just call `seek()` and `read()` to anywhere, and ZFS will transparently provide zeros to fill the holes. Linux also allows software to become "hole-aware" using `lseek()`, but that's an optimisation that software can opt into, but can equally just ignore.

The glitch in this case was a failure to correctly track dirty pages that have yet to be written to disk, and thus reading the on-disk data, rather than the data in-memory data within the dirty page. I just so happens this issue only appears in the code that's responsible for responding to queries about holes from software that's explicitly asking to know about the holes. ZFS itself never had any issues keeping track of the holes, the bookkeeping always converged on the correct state, it's just that during that convergence it was momentarily possible to be given old metadata about holes (i.e. what's currently on disk), rather than the current metadata about holes (i.e. what's currently only in-memory, and about to be written to disk).

benlivengood · on Dec 26, 2023

There are pretty good reasons for treating files as sparse; virtualization and deduplication. Virtualization of storage devices without sparse files would be slowed tremendously by the need to allocate and zero large regions before use, essentially double-writing during the installation and initial provisioning stage. You can force the virtualization layer to implement sparse storage but then you get a host of incompatible disk image formats (vmdk, qcow2, etc.) and N times as many opportunities for bugs like the article describes to be introduced.

Deduplication is basically a superset of sparse files where the zero block is a single instance of duplication. Deduplication isn't right for every task but for basically any public shared storage some form of deduplication is vital to avoid wasting duplicate copies.

Sparse/deduplicated files still maintain the read/write semantics of files as streams of bytes; they allow additional operations not part of the original Unix model. Exposing them to userspace probably isn't a mistake per se because it is essentially no different than ioctls or socket-related functions that are a vital part of Unix at this point.

ajross · on Dec 27, 2023

Those aren't general principles. They're just tricks. Some software uses them. No significant software paradigms are critically dependent on sparse files. Quite frankly almost no significant market-driving software uses them at all. Not sure what you have in mind, but a few examples might be helpful?

All of them have a straightforward expression using contiguous storage. At best, sparse files allow you to reduce application-layer complexity. But as I'm pointing out, that comes at the cost of filesystem-layer complexity up and down the stack and throughout the kernel, and that's a bad trade.

cogman10 · on Dec 26, 2023

This a feature I was completely unaware of. Why would you choose to use a sparse file instead of multiple files?

wolf550e · on Dec 26, 2023

Imagine a torrent client (or http client downloading a file in parallel using HTTP range requests). It creates an empty file and then it has downloaded a 1MB of data to write at offset 100GB and wants to write it to disk. It does not want to pay the price of waiting for 100GB of zeroes to be written. The other blocks will all be downloaded and written eventually, all out of order. If the filesystem had an atomic operation to transform a bunch of (block aligned) files into a single file (like AWS S3 Multipart Upload), then sparse files would not be needed for this case.

loeg · on Dec 26, 2023

Fallocate is a much better interface for this than sparse files. The torrent client does not care how the underlying filesystem provides the ability to randomly write a large file. And fallocate is a much clearer signal to the filesystem than a sparse file.

avianlyric · on Dec 26, 2023

fallocate is just an interface to create sparse files. The result of using `fallocate` is a sparse file.

loeg · on Dec 26, 2023

You should read my comment in the context of the one it is replying to. That comment suggested a torrent client using seeks + writes to randomly insert chunks as they were downloaded. I have summarized this approach in my comment as "sparse files," expecting charitable readers to be familiar with the context. This method of creating sparse files does not tell the filesystem anything about the intent of the application and usually creates a bunch of fragmentation under torrent-like workloads.

avianlyric · on Dec 27, 2023

“sparse files” are specific term[1] referring to files where the filesystem tracks and doesn’t allocate space for unwritten file content (i.e. content that would just be zeros if read) in large preallocated files.

To use the term “sparse file” to also refer to files with large continuous runs of zeros, created via a seek operation, is just confusing. Those are quite explicitly not sparse files, they’re just files, that happen to be full of zeros (all written to disk). “Sparse file” are quite explicitly the result of the optimisation to avoid writing pointless zeros when preallocating a large file that’s going to written into in an unordered manner.

Using the term “sparse files” to refer to both the “problem” and the “solution” is just unhelpful, and doesn’t align with the accepted meaning of the term.

[1] https://en.m.wikipedia.org/wiki/Sparse_file

epcoa · on Dec 26, 2023

It’s not about being charitable. For those unfamiliar with the terminology this is just confusing, and for those that are familiar this discussion is all fundamental and well known anyway.

Unfortunately for COW filesystems including zfs and btrfs fallocate doesn’t do anything useful for preallocation. You’re still going to get fragmentation. The two methods outlined are essentially equivalent.

loeg · on Dec 27, 2023

> For those unfamiliar with the terminology this is just confusing, and for those that are familiar this discussion is all fundamental and well known anyway.

Eh, agree to disagree.

> Unfortunately for COW filesystems including zfs and btrfs fallocate doesn’t do anything useful for preallocation.

Both ZFS and BtrFS have "nocow" modes that are probably more suitable to this type of use case. And other filesystems are widely used.

finnh · on Dec 27, 2023

Can you point me to docs for ZFS offering a nocow mode? I haven't used it in about a decade, but i can't see how that would work - wafl/cow is a pretty fundamental invariant in everything ZFS does

myself248 · on Dec 26, 2023

Out of curiosity, do you know off-hand how torrent clients do it on filesystems that don't support sparse files? There must be either a preallocate-the-whole-thing step, or a gather-the-pieces-together-and-write-out-the-large-file step. The latter would seem to briefly double the disk space needed at the end of the download, so I suspect they do the former.

ajross · on Dec 26, 2023

Chunk it up and resassemble, one assumes. Things aren't nearly as clear in the modern world of gigabit pipes into suburban households[1], but when these things were written the filesystem was 100x faster than the link to those peer connections from which the data was fetched. A final copy was only a small overhead.

[1] Which is why all the stuff we used to torrent is in the cloud now.

cesarb · on Dec 26, 2023

AFAIK, they preallocate; and even on filesystems which support sparse files, most bittorrent clients have an option to always preallocate (to both reserve space and reduce fragmentation).

andoma · on Dec 26, 2023

Somewhat related is that a few filesystem types on Linux allows you to remove / insert bytes "within" a file. But it needs alignment to filesystem block size. This uses the same syscall, fallocate(2), which can be used to punch new sparse holes in a file where it previously had data.

See https://man7.org/linux/man-pages/man2/fallocate.2.html

axus · on Dec 26, 2023

A memory map of a gigabyte/whatever, that uses a bunch of large addresses but typically only uses 1% of the available space. Saves someone the trouble of managing the map or compressing it.

It does feel like a weird decision from long time ago that we're stuck with. I thought it was some quirky Linux feature but it's been around https://en.wikipedia.org/wiki/Sparse_file

vlovich123 · on Dec 26, 2023

The number of file descriptors you can have open by a single program is limited and eats up kernel resources.

ajross · on Dec 26, 2023

Even for a torrent client, the number of active file descriptors is a function of the number of peer connections (e.g. a few dozen). It doesn't scale with the size of the output file.

vlovich123 · on Dec 26, 2023

Think databases like RocksDB not torrents.