(Translated by https://www.hiragana.jp/)
Fanotify and hierarchical storage management [LWN.net]
|
|
Subscribe / Log in / New account

Fanotify and hierarchical storage management

LWN.net needs you!

Without subscribers, LWN would simply not exist. Please consider signing up for a subscription and helping to keep LWN publishing

By Jake Edge
May 23, 2023
LSFMM+BPF

In the filesystem track of the 2023 Linux Storage, Filesystem, Memory-Management and BPF Summit, Amir Goldstein led a session on using fanotify for hierarchical storage management (HSM). Linux had some support for HSM in the XFS filesystem's implementation of the data management API (DMAPI), but that code was removed back in 2010. Goldstein has done some work on using fanotify for HSM features, but he has run into some problems with deadlocks that he wanted to discuss with attendees.

He began by pointing to a wiki page he created to describe HSM and his goals for using fanotify to support it. His employer is CTERA Networks, which builds "cloud gateway solutions", where files appear to be available on the local system even though they may be cached on a local network-attached storage (NAS) device or stored somewhere else in the cloud. The NAS might not have space to accommodate all of the data, but it functions as a (more) local cache.

[Amir Goldstein]

Windows has an API for HSM, so files have a status that reflects their location; users can decide if they want to access a file if, for example, it will require a lengthy copy from the cloud. This HSM support is based on "reparse points" in NTFS; when those are encountered, another filesystem driver is called to provide the file data. There is nothing like that in Linux, so those who provide that functionality have to implement their own scheme; CTERA uses FUSE.

The FUSE solution comes with various kinds of problems and he hopes that some of the alternatives being discussed at LSFMM+BPF will help alleviate them. DMAPI is an old API, which is insufficient for today's HSM needs, though the code from XFS still exists if there is anything useful in it; remnants of it are still present in Linux, as the "punch hole" interface was added for DMAPI. When the DMAPI hooks were removed, there was a comment suggesting that "at least the namespace events can be done much saner in the VFS", which is what Goldstein is trying to do now.

He showed that a simple HSM can be implemented using the existing upstream fanotify API. It could use sparse files to represent the data that is not local. It does so by first getting an exclusive lock on the file object using fcntl(fd, F_SETLEASE, F_WRLCK), migrating the content elsewhere, and then punching a hole in the file using FALLOC_FL_PUNCH_HOLE to fallocate(). The HSM service can subscribe to various types of fanotify events in order to be notified when the content, permissions, or directory entry of the file is changed; the cloud version can then be updated as needed. "It is very naive, but it works."

However, it is not practical for today's use. For example, users have to download their entire movie, say, before starting to watch it. He has a patch set to add some features to fanotify that would make it more usable as an HSM, which he posted (as a pointer to his Git tree) in an email back in September 2022. The resulting thread eventually led to the session at the summit.

The changes are small, he said, simply adding a few more fanotify event types (or additional information to existing events), which would facilitate the HSM use case. They are described further in a section of the wiki page and would allow features like populating directories on demand, streaming downloads of large files, and crash-safe change tracking. He has been working on change tracking for a number of years now in various guises; he has an internal solution, but would like to get something into the mainline.

He described a demo that he did not have time to actually perform, which can be seen in slide 6 of his slides; it was based on the HTTPDirFS FUSE filesystem, which allows read-only mounts of a directory accessed via HTTP. Goldstein modified it to use fanotify on a kernel with his patches. It would allow him to mount the kernel.org /pub directory locally, then access a file deep in the directory hierarchy. The filesystem lazily populates the needed directories into the local directory where it is mounted. The mount point is no longer a FUSE mount in that mode, but is a bind mount instead, with fanotify events being monitored. He displayed an example command that would display the first few lines of a tar table of contents of a large file. Only the first 1MB of the file would be transferred before the command completed, rather than waiting for the entire contents.

He had two more slides after the "demo" slide, which were increasingly complex, he said. They were an attempt to explain some problems that he has found, "in order to try to sell the solution". At one time, there was a problem with the original fanotify API where an operation caused a FAN_ACCESS_PERM event, which might require the fanotify service to access the file; that results in a second (blocking) FAN_ACCESS_PERM event which leads to a user-space deadlock. That was solved by adding a special file descriptor that can be used by the service to perform actions without triggering another fanotify event.

But now there is another deadlock that can happen with the existing API; it is perhaps rare, but it can happen and he is surprised that it has not been reported. It involves a clone file range operation, which takes the superblock freeze lock, but it may cause the HSM (or other fanotify-based) service to also need to freeze-lock the superblock. If the files are on the same filesystem (thus share the same superblock), a deadlock will result.

This deadlock is perhaps more common in his HSM service than in other types of fanotify-based scanners (e.g. virus scanners). He has solved it by using a new event flag (FAN_PRE_VFS) that gets added to FAN_ACCESS_PERM events if the freeze lock has not been taken. He then went through and added that flag in the places where it was true, which involved calling the notify hook in some new places. That gives the service an opportunity to fill the file before the clone file range operation freezes the superblock. That was his solution, which was not hard to do, Goldstein said.

He moved on to the second even-more-complicated slide, which covered a similar kind of deadlock, but it could also result in a race condition that would cause his HSM to miss filesystem changes at times. The scenario was well beyond my ability to follow it, but a video of the session should be available before long. His solution to the problem, which was suggested by Jan Kara, was to use sleepable RCU, which would avoid the race at the cost of an occasional false-positive change notification.

Once attendees seemed to get up to speed on the problem (and proposed solution), the session ran out of time, though discussion spilled over into the next slot. Josef Bacik said that he did not hate the solution that had been chosen, though he did not love it either. Kara explained why sleepable RCU was chosen, and Goldstein thought that the general idea could be applied to other filesystem-related ordering problems (such as when an inode's i_version field gets incremented).


Index entries for this article
Kernelfanotify
KernelStorage management
ConferenceStorage, Filesystem, Memory-Management and BPF Summit/2023


(Log in to post comments)

Fanotify and hierarchical storage management

Posted Jun 23, 2023 15:28 UTC (Fri) by psusi (guest, #95157) [Link] (2 responses)

"He displayed an example command that would display the first few lines of a tar table of contents of a large file. Only the first 1MB of the file would be transferred before the command completed, rather than waiting for the entire contents. "

That isn't how tar works. It does not have a table of contents at the start of the file. zip and dar do, but with tar, every file metadata record is immediately followed by its data. To list the files in the tar, the entire tar file must be read.

Fanotify and hierarchical storage management

Posted Jun 23, 2023 16:15 UTC (Fri) by jake (editor, #205) [Link]

> To list the files in the tar, the entire tar file must be read.

I don't know anything about the tar format, but the "demo" was to show the first few entries in the tar file. Given what you said about the format of tar, that would seem plausible from reading the first MB of the tar file (if the first few entries were contained in that chunk of the file).

jake

Fanotify and hierarchical storage management

Posted Jun 23, 2023 16:55 UTC (Fri) by farnz (subscriber, #17727) [Link]

The command didn't list all the files in the tar, though, just the first few. If I have thousands of small files in a tar, each under 4 KiB in size, then reading the first 1 MiB of the file is enough to list the first 256 files; if my command is `tar tf file.tar | head -n 3`, then I'm not going to read much more than 1 MiB before tar gets SIGPIPE from head, and shuts down.


Copyright © 2023, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds