(Translated by https://www.hiragana.jp/)
Leading items [LWN.net]
|
|
Subscribe / Log in / New account

Leading items

Welcome to the LWN.net Weekly Edition for June 2, 2022

This edition contains the following feature content:

This week's edition also includes these inner pages:

  • Brief items: Brief news items from throughout the community.
  • Announcements: Newsletters, conferences, security updates, patches, and more.

Please enjoy this week's edition, and, as always, thank you for supporting LWN.net.

Comments (none posted)

splice() and the ghost of set_fs()

By Jonathan Corbet
May 26, 2022
The normal rule of kernel development is that the creation of user-space regressions is not allowed; a patch that breaks a previously working application must be either fixed or reverted. There are exceptions, though, including a 5.10 patch that has been turning up regressions ever since. The story that emerges here shows what can happen when the goals of stability, avoiding security problems, and code cleanup run into conflict.

The set_fs() function was added to the kernel early in its history; it was not in the initial 0.01 release, but was added before the 0.10 release in late 1991. Normally, kernel code that is intended to access user-space memory will generate an error if it attempts to access kernel space instead; this restriction prevents, for example, attempts by an attacker to access kernel memory via system calls. A call to set_fs(KERNEL_DS) can be used to lift the restriction when the need arises; a common use case for set_fs() is to be able to perform file I/O from within the kernel. Calling set_fs(USER_DS) puts the restriction back.

The problem with set_fs() is that it turns out to be easy to forget the second set_fs() call to restore the protection of kernel space, leading directly to the "total compromise" scenario that kernel developers will normally take some pains to avoid. Numerous such bugs have been fixed over the years, but it had long been clear that the real solution was to just get rid of set_fs() entirely and adopt safer ways of accessing kernel memory when needed.

Developers (and Christoph Hellwig in particular) got more serious about this objective in 2020 and made a determined push to eliminate set_fs() entirely. Much of this work went into 5.10, though the final bits of the set_fs() infrastructure were only removed in 5.18. Back in 2020, though, one question that provoked some discussion was what should be done about splice().

The splice() system call will connect an open file descriptor to a pipe, then move data between the two for as long as the data stream lasts. This movement happens entirely within the kernel, potentially eliminating the need for large numbers of system calls; in some settings, it can provide a significant performance improvement. By its nature, splice() often has to move data to and from buffers that are in kernel space; to make that possible, it used set_fs().

Hellwig duly came up with a new implementation that would keep splice() working in the absence of set_fs(), but Linus Torvalds rejected it, saying that he didn't like the "complexity and messiness" of the implementation. But he also made it clear that he didn't feel the need to guarantee that splice() would keep working at all; he felt that making splice() work by default on most file types led to a number of security issues. Later in 2020, for example, he said:

I'd rather limit splice (and kernel_read too, for that matter) as much as possible. It was a mistake originally to allow it everywhere, and it's come back to bite us.

So I'd rather have people notice these odd corner cases and get them fixed one by one than just say "anything goes".

So the patches that went into 5.10 ended up breaking splice() for any file type that did not have explicit support for the new way of doing things; the idea was that the important cases would be noticed and fixed over time. That has indeed happened; if one looks for patches committed as explicit fixes to the disabling of splice() support, one finds fixes for the AFS filesystem, the 9p filesystem, the orangefs filesystem, /proc/mountinfo, the TTY subsystem, kernfs, sendfile(), the nilfs2 filesystem, and the JFFS2 filesystem.

Most recently, Jens Axboe reported that splice() no longer worked on /dev/random or /dev/urandom; he included a patch to fix the problem as well. These patches were later reworked by random-number-generator maintainer Jason Donenfeld and were applied to the mainline during the 5.19 merge window. Along the way, Donenfeld observed that the necessary changes resulted in a performance regression of about 3% when reading from /dev/urandom. That led him to ask whether the fix was something that was needed at all; after some discussion, Axboe gave him the lecture on regressions:

If you have an application that is written using eg splice from /dev/urandom, then it should indeed be safe to expect that it will indeed continue working. If we have one core tenet in the kernel it's that you should ALWAYS be able to upgrade your kernel and not have any breakage in terms of userspace ABI. Obviously that can happen sometimes, but I think this one is exactly the poster child of breakage that should NOT happen. We took away a feature that someone depended on.

That is the sort of breakage that did indeed happen but, in this case, a change was made knowing that this kind of problem would result. Hellwig said in response to Axboe's patch set that "compared to my initial fears the fallout actually isn't that bad", but a perusal of the above list of fixes might lead one to a different conclusion.

The removal of set_fs() is, in many ways, a model for what the kernel development process can do. A fundamental piece of low-level structure that had been deeply wired into the kernel since the beginning was replaced with a much safer alternative without breaking the project's pace of a stable release every nine or ten weeks. The steady stream of regressions resulting from this change, though, is not what the project sets out to do — and it seems certain that this particular gift has not yet stopped giving.

The decision to take this path was driven by a fear of security problems, based on the past history of the splice() system call. If those fears are still justified (and they might well be; consider, for example, that splice() was a part of the "Dirty Pipe" vulnerability reported earlier this year), then refusing to make all existing splice() implementations just work without set_fs() may have prevented far worse regressions than the ones we have seen. Having to fix a filesystem is annoying; having to endure yet another security drill for a branded vulnerability with a silly name is rather more so.

There is no way of knowing whether that is how things would have gone in this case. But it is true that this type of episode makes the kernel's "no regressions" rule look a bit more like just a guideline. It does not take too many of those to create breakage to the project's reputation that is hard to splice back together.

Comments (11 posted)

5.19 Merge window, part 1

By Jonathan Corbet
May 27, 2022
As of this writing, just under 4,600 non-merge changesets have been pulled into the mainline repository for the 5.19 development cycle. The 5.19 merge window is clearly well underway. The changes pulled so far cover a number of areas, including the core kernel, architecture support, networking, security, and virtualization; read on for highlights from the first part of this merge window.

Interesting changes pulled into the mainline so far include:

Architecture-specific

  • A number of x86-specific boot options (nosep, nosmap, nosmep, noexec, and noclflush) have all been removed. Each of these disabled a CPU feature that it no longer makes sense to disable.
  • Support for the a.out executable format on x86, which was deprecated in the 5.1 release, has now been completely removed.
  • The x86 split-lock detection mechanism has been made a bit stronger; rather than just warning (by default) when a process uses split locks, the kernel will slow that process down considerably. That should preserve the performance of the rest of the system and, with luck, cause the offending application to be fixed.
  • The new Intel "in-field scan" mechanism can run diagnostics and detect CPU problems in deployed systems. This documentation commit has more information.
  • The xtensa architecture has gained support for a number of features, including SMP coprocessors, KCSAN, hibernation, and more.
  • The m68k architecture now implements a virtual machine based on the Android Goldfish emulator.
  • The Arm Scalable Matrix Extension is now supported (in host mode only, not for guest systems).

Core kernel

  • The io_uring subsystem has seen a number of enhancements. The new IORING_RECVSEND_POLL_FIRST flag will, when set for networking operations, cause an operation to go directly to polling rather than attempting a transfer first; this can save some overhead when the caller expects the operation to not be able to proceed immediately. There are some new flags to ease the management of fixed file descriptors. The "multi-shot" mode for accept() allows multiple connections to be accepted in a single operation. There are new operations to manipulate extended attributes on files. The socket() system call is now supported. Finally, there is also now support for "passthrough" operations that can send NVMe commands directly to the device.

    All of these new API features are diligently undocumented.

  • It is now possible to store typed pointers in BPF maps; this merge commit has some more information. This feature should not be confused with "dynamic BPF pointers", which will also be in 5.19; this merge commit contains some information.

Filesystems and block I/O

  • The EROFS read-only filesystem has been significantly reworked to use the fscache layer. This feature can, evidently, significantly improve performance on systems running a lot of containers from EROFS images. This merge message has a bit more information.
  • The EROFS work involved adding an "on-demand mode" to fscache, which is documented in this commit.

Hardware support

  • Hardware monitoring: Aquacomputer Octo temperature sensors and fan controllers, Aquacomputer Farbwerk 360 temperature sensors, Infineon XDPE152 voltage regulators, Microchip LAN9668 temperature sensors, and Nuvoton NCT6775F I2C interfaces.
  • Miscellaneous: Nvidia SN2201 platform switches, Silicon Mitus SM5703 voltage regulators, and MediaTek SPI NAND flash interfaces.
  • Networking: Marvell Octeon PCI Endpoint NICs, CTU CAN-FD IP cores (see the documentation), Analog Devices Industrial Ethernet T1L PHYs, pureLiFi LiFi wireless USB adapters, MediaTek PCIe 5G WWAN modem T7xx adapters, Texas Instruments DP83TD510 Ethernet 10Base-T1L PHYs, Sunplus Dual 10M/100M Ethernet adapters, and Realtek 8852CE PCI wireless network (Wi-Fi 6E) adapters.

    Also: a number of old networking drivers have been removed (commit, commit, commit, commit, commit, commit) as being unmaintained and, presumably, unused.

  • Additionally: the power-management subsystem has gained support for devices that operate on an "artificial" power scale. In short, this means such a device provides information about the relative efficiency of different power states, but that information is not tied to any real-world scale. This documentation commit contains a little more information.

Networking

  • The BIG TCP patch set has been merged; this work allows for the sending of huge IPv6/TCP packets on data-center networks.
  • The addition of packet-drop annotations continues, improving an administrator's visibility into why network packets are not making it through the system.
  • The multipath TCP (MPTCP) protocol can now fall back to regular TCP in some situations where the multipath features cannot be used.
  • There is also a new user-space API for the management of MPTCP flows. Documentation is scarce but there is an introduction in this merge commit.

Security-related

  • Various confidential-computing mechanisms allow secrets to be pushed into virtual machines without exposing them to the host system. The kernel's EFI subsystem can now expose those secrets to the guest via a directory (security/coco) under securityfs. The documentation in this commit and this commit gives some more information.
  • The kernel's lockdown mode will prevent even a privileged process from changing kernel memory outside of the kernel's control — or, at least, that is the intent. It turns out that lockdown is easily bypassed by simply firing up a kernel debugger. This fix, applied to the mainline (and certainly headed toward the stable updates), closes the hole.
  • There have been a number of improvements to the random-number generator to improve robustness and performance; this merge commit contains an overview.
  • The structure randomization hardening feature is now available with the Clang compiler as of version 15.
  • The Landlock security module now supports rules controlling the renaming of files.
  • The Integrity Measurement Architecture (IMA) can now use fs-verity file digests for verification.
  • The meaning of "unprivileged BPF" has changed somewhat. In current kernels, disabling unprivileged BPF makes all bpf() system-call commands unavailable. In 5.19, instead, unprivileged processes will have access to commands that do not actually create objects. That enables scenarios where a privileged process loads a BPF program, then allows an unprivileged process to interact with it. This merge commit has a little more information.

Virtualization and containers

  • Support for AMD's Secure Nested Paging feature has been added. In short, this feature will cause a virtual machine to be notified if its encrypted memory has been accessed outside of the machine. This mechanism can, among other things, thwart replay attacks.
  • Support has also been added for Intel's Trusted Domain Extensions (TDX) mechanism, which provides some similar features. See this documentation commit for some more information.

Internal kernel changes

The 5.19 merge window is just getting started; it can be expected to remain open through June 5. Once it closes, LWN will be back with a summary of what was pulled in the second half; stay tuned.

Comments (none posted)

ID-mapped mounts

By Jake Edge
May 30, 2022
LSFMM

The ID-mapped mounts feature was added to Linux in 5.12, but the general idea behind it goes back a fair bit further. There are a number of different situations where the user and group IDs for files on disk do not match the current human (or process) user of those files, so ID-mapped mounts provide a way to resolve that problem—without changing the files on disk. The developer of the feature, Christian Brauner, led a discussion at the 2022 Linux Storage, Filesystem, Memory-management and BPF Summit (LSFMM) on ID-mapped mounts.

He began with an introduction. There are multiple use cases, but he likes to talk about portable home directories first because they are not related to containers, which many think is the sole reason for ID-mapped mounts. A portable home directory would be on some kind of removable media that can be attached to various systems, some of which have a different user and group ID for the user, but, of course, the media has fixed values for those IDs. ID-Mapped mounts allow the device to be mounted on the system with the IDs remapped to those of the user on the local system.

[Christian Brauner]

Beyond that, of course, are various container use cases, such as sharing a root filesystem with multiple containers, each of which is using its own user namespace with a different mapping for UID 0. Each of the containers needs to be able to access the files as "root", but UID 0 inside the namespace is mapped to some nonzero UID on the host system; an ID-mapped mount would enable that nonzero ID to be mapped to UID 0 for filesystem access. Similarly, sharing data between a host filesystem and one in a user namespace may require remapping the IDs. Some of these cases were handled with expensive recursive chown calls before ID-mapped mounts came along.

There are some filesystems that can be used in user-namespace-based containers, most notably overlayfs, but there are still lots of limitations and the main filesystem types, Btrfs, XFS, and ext4, are not really able to be used in that manner. Once all of the use cases were gathered, he said, the most flexible solution turned out to be a per-mount mapping of UIDs and GIDs, which is what ID-mapped mounts provide.

The API for the feature uses the mount_setattr() system call, which allows changing the ID mappings as well as other attributes of mounts. Brauner clarified that the feature applies to all virtual filesystem (VFS) mounts, so bind mounts are included. Unlike mount(), mount_setattr() allows changing mount attributes recursively.

Using the feature requires passing a flag and a file descriptor to mount_setattr(); the file descriptor is that of a user namespace that does the ID mapping that should be applied to the mount. The implementation was done in the VFS layer, so individual filesystems "do not need to be really aware of it"; there are APIs available to make it easy on the filesystems, he said. Ted Ts'o asked about a command-line tool for doing an ID-mapped mount; Brauner said that one should be merged soon into util-linux.

Amir Goldstein noted that fstests already has a binary tool for testing these mounts. Brauner added that there are 15K lines of code in tests, already upstream in fstests, for ID-mapped mounts that aim to test the feature in all possible combinations. That includes things like access-control lists (ACLs), Linux capabilities, setuid and setgid execution, and so on. Every time a bug or regression is found, a new test is added to the suite.

He spent a bit of time demonstrating the tool and the feature, noting that the mapping works in both directions: IDs of files in the mount follow the mapping and files created within the mount have the reverse-mapped IDs outside of it. The feature is already being used by various tools, such as systemd-nspawn and systemd-homed; it has also been added to the runC container specification, so "there is lots of activity going on around this".

Currently, ext4, XFS, Btrfs, and several other filesystems support the feature; there is a patch set for overlayfs that is on-track to be merged soon. David Howells asked what filesystems need to do to support ID-mapped mounts. Brauner said that "in principle it is easy" to do so. Network filesystems may have some additional wrinkles, however; he has a patch set for Ceph but it still needs more work. The changes for ext4 and XFS were small, he said, and others are likely to be similar because most filesystems do not really use the IDs directly. The XFS quota-handling code does use the IDs, so it needed a bit more work. There is a long document available and he is willing to help add it to other filesystems.

Network filesystems need to determine which ID they want to send to the server, he said. Normally, the mapped ID is the right choice, but that may not be true for all cases.

Chuck Lever asked how the ID mapping could be changed for an existing mount and wondered if it could just be remounted to make that change. Brauner said that no changes are allowed once the namespace has been attached to the mount or the mount has been attached to the filesystem. Due to "lifetime issues" with regard to the use of the mapping, it is too complicated to allow changes once the filesystem has been fully mounted. Using the new mount API, a user will create a detached mount, then set the ID mapping on it, then, finally, attach it to the filesystem.

Lever also asked about the limits for the number of entries in the mapping; for example, in a system with thousands of users, where each user should be mapped to their own ID in a single mount. Brauner said that user namespaces were originally limited to five mappings, but he raised that limit to 340 in 2015 or 2016. It will be difficult to increase it beyond that, he said, because mapping is done in a hot path; he optimized the data structure for the mappings and increasing it further will have a performance impact.

Ts'o wondered if there was any thinking about supporting "project IDs", which are used by some container systems; those IDs are used for project-wide quotas in filesystems. Brauner said that project ID needs to be revisited, since "we have dodged this issue for years". The intended semantics are not clear, so he has been confused when looking into it.

While both XFS and ext4 support those IDs, Ts'o said he is confused by the semantics as well, at least with respect to user namespaces. He and Darrick Wong discussed it at one point and it was not clear whether both filesystems worked the same way, though there is an intention to unify their behavior. Brauner said that quota handling is not the same between different filesystems in Linux; each seems to have its own quirks. In the Zoom chat, Jan Kara pointed out that ID-mapping changes had not been made to the VFS quota code, at least yet; that was relayed as time expired on the session, however.

Comments (14 posted)

Filesystems, testing, and stable trees

By Jake Edge
May 31, 2022
LSFMM

In a filesystem session at the 2022 Linux Storage, Filesystem, Memory-management and BPF Summit (LSFMM), Amir Goldstein led a discussion about the stable kernel trees. Those trees, and especially the long-term support (LTS) versions, are used as a basis for a variety of Linux-based products, but the kind of testing that is being done on them for filesystems is lacking. Part of the problem is that the tests target filesystem developers so they are not easily used by downstream consumers of the stable kernel trees.

His interest in the problem comes about because he is using the 5.10 LTS kernel and the XFS filesystem. He realized that XFS is not being maintained in that kernel; there are only three XFS patches backported to it in the past two years or more. There is some history behind that, though most in the room already know it, he said.

[Amir Goldstein]

He has been backporting XFS patches to 5.10 because there are more than just three bug fixes for XFS since that kernel was released. In something of a disclaimer, he said that it is his responsibility to do those backports; he is not suggesting that others should be doing that work. He has made some progress with the backports and has been doing some testing of them in conjunction with Luis Chamberlain. His intent in the session was to discuss the process for stable kernels and filesystems.

One reason that the stable kernels exist, Goldstein said, is to allow multiple organizations to collaborate and "not duplicate work". That only works if the LTS releases are used by the "big players", so the value of those releases drops if they are not widely used. Many distributions do not use the LTS kernels, but there are some organizations that do. Google Cloud, for one, is following the stable kernel releases, and he has heard that Microsoft is doing the same. Android is also following the stable releases, but that project has no interest in XFS.

The key to having stable kernels with stable filesystems is being able to run fstests (formerly xfstests) on them. That means collaborating on testing, the test suite, and the baselines of which tests are expected to pass and fail. Josef Bacik said that when he worked at Red Hat, one of the pain points was in running the most recent fstests on older kernels, as it would "blow up" in various ways, which was annoying. But running the latest fstests and seeing newer tests fail can also point to patches that you may want to backport "depending on how much pain you are willing to absorb", he said.

Goldstein said that fstests are mainly used to test the upstream kernels; when they are applied to LTS kernels "things happen" so it is not easy to do so. Fstests is not friendly to people trying to test LTS kernels, which is a different approach than that of another test framework that he works with, the Linux Test Project (LTP). That project has some practices that could be adopted by fstests; in particular, having a standard way to annotate regression tests, giving the commit that fixed the bug and what version of the kernel it is fixed in. That way, if the test fails on a different kernel, "you get a hint" that maybe a backport of that commit is needed or, perhaps, that the kernel under test will not support the feature being tested.

LTP also has a simple script that can be run on a kernel branch to determine if it has the commits that appear in the annotations, or has backports of those commits that refer to the original commits. That will give you a list of the tests that should work; the list will be customized to that exact kernel branch, he said.

Ted Ts'o said that most filesystems are happy to allow the stable developers to choose fixes to incorporate—XFS is a notable exception to that. For ext4, that the process works well, he said; every year or so there is a problematic ext4 patch that has to be reverted from the stable trees because it was not suitable for them. Normally, those kinds of patches are spotted during the stable review.

Ts'o and his team have been working on identifying XFS patches to apply to the 5.15 kernel, because that is a kernel of interest for Google, using the same scripts that Greg Kroah-Hartman and Sasha Levin use to identify candidate patches. It has taken longer to do this work than he had hoped, in part because of the time it has taken to get a baseline of which fstests should be passing so that they can detect failures caused by backports. They have been using an automated test system, with around ten different configurations based on input from XFS maintainer Darrick Wong.

It turns out that there were some fstests that only passed if they cherry-picked some of the "hundred-odd out-of-tree commits" that are in Wong's personal fstests tree, but have not yet gotten to the upstream repository. So, Ts'o now has his own fstests branch with the pieces from Wong that were needed.

It is his intent to report on the work that they have done to the XFS mailing list, including a list of the patches that they are proposing to add to 5.15. After that, there will need to be a negotiation about what is considered appropriate testing, Ts'o said, as well as a need to figure out how the XFS maintainers want to proceed. Whether the process will be to propose the fixes for stable and await any explicit nacks from the XFS folks, or whether the XFS maintainers will be explicitly choosing the set of patches to add to stable, is unclear at this point. That is a conversation that he hopes to have soon.

Chamberlain said that in the past, the XFS maintainers have agreed that he and Goldstein could review XFS patches for the stable kernels. But, as noted by Ts'o and others, establishing the baseline takes a lot of thankless work; it also requires fairly large systems, Chamberlain said. Right now, each developer is making their best effort at testing, but the community needs to collaborate more on the testing effort; the next LSFMM session would cover some of that, he said. Candidates for XFS fixes can be sent to him and Goldstein; they will queue the patches up for their testing, which will help give some confidence about whether the patches are good candidates or not.

Jan Kara came in over the Zoom link to say that the distributions, including SUSE where he works, do care about XFS fixes. The SUSE folks pick up XFS fixes and he thinks that Red Hat does the same thing. If those fixes do not end up in the stable kernel, they get backported to the enterprise kernels and then tested. The resources required to do all of that are fairly large. There is a need for developers with "at least a bit of a clue" to look at the patches to see if they make sense to be backported and then do that work if so. Then there is "quite a lot of testing", he said.

Goldstein talked about a tool that he created when he was looking at all of the XFS fixes from 5.10 to 5.17, which turned out to be around 600 patches. The tool uses the public-inbox mailing list archives to collect up all of the relevant patch series and, in particular, the cover letter. That made it much easier to see what dependencies there are and which patches to choose. It is "still human work", but the tool is a great assistant.

Ts'o noted that he does a round of testing of ext4 every three to four months using the latest LTS kernels. The resources required to actually run the test are modest; for a few dollars of Google Cloud time, he can run multiple configurations of fstests. The expensive part is the developer time to interpret the failures and to figure out if there is patch that did not get automatically chosen but should have.

Every time he does that round of tests, he finds one to three patches that he needs to manually backport and send to the stable developers. He is not sure whether other filesystem maintainers are doing similar testing, but it is valuable. That kind of testing is also not something that the maintainers themselves would need to do, it might be a good opportunity to add some newer developers to the filesystem community, he suggested.

There was some more discussion of what needs to be done to make it easier to run fstests on older kernels. Steve French wondered if there needed to be stable branches of fstests that could be kept in sync with the stable kernel releases. Goldstein said that annotations of commits and versions for fixes will be important to make it easier to use fstests on a wider variety of kernel versions.

Comments (7 posted)

Challenges with fstests and blktests

By Jake Edge
June 1, 2022
LSFMM

The challenges of testing filesystems and the block layer were the topic of a combined storage and filesystem session led by Luis Chamberlain at the 2022 Linux Storage, Filesystem, Memory-management and BPF Summit (LSFMM). His goal is to reduce the amount of time it takes to test new features in those areas, but one of the problems that he has encountered is a lack of determinism in the test results. It is sometimes hard to distinguish problems in the kernel code from problems in the tests themselves.

He began with a request to always use the term "fstests" for the tests that have been known as "xfstests". The old name is confusing, especially for new kernel developers, because the test suite has long been used for testing more than just the XFS filesystem. It is not just new folks, though; even at previous LSFMMs, he has seen people get confused by the "xfs" in the name.

[Luis Chamberlain]

He noted that it takes ten years or more to stabilize a new filesystem; long ago he set an objective for himself to try to help with that problem. One of the ways to do so is to reduce the amount of time that it takes to run filesystem tests. Along the way, he decided to try to reduce the time it takes to test new features in the block layer as well.

One of the differences he has observed for fstests and blktests versus tests for the rest of the kernel is that their determinism is lower. The KUnit tests are "extremely deterministic", while the kernel selftests are highly deterministic, though they do sometimes fail unexpectedly. On the other hand, fstests and blktests are just the opposite; they can be "extremely non-deterministic", he said.

One of the takeaways from his findings is that the time spent on "testing" needs to be divided properly. There are four separate parts of that, which should all get roughly equal amounts of time: test design, tracking results, reporting bugs, and fixing low-hanging fruit. The kernel development community is "actually pretty good at test design", but does not really spend enough time on the other parts of the testing puzzle.

He has worked on the kdevops project to try to make some of that better. It uses Kconfig for its configuration and allows users to choose between cloud or local virtualization for bringing up systems for kernel testing. But that was not the topic of the session, he said, rather the topic is the lessons that he has learned from that effort.

One example of non-deterministic behavior is an ext4 test that fails once in 300 runs of fstests. When he asks filesystem developers how many times they run fstests in a loop, he gets a funny look, he said; but if you do run it in a loop, you will find some of these sporadic failures. Another example was a failure in blktests one time out of 80 because of an RCU stall. It turned out to be a problem in the QEMU zoned-device emulation, but that false positive in blktests helped track down the problem.

Another example was in the "block/000" "block/009" test in blktests, which would fail once out of 669 times. It took around eight months to track down the problem and reach a consensus on the fix. Jan Kara merged a fix for 5.12 that could potentially be backported to earlier stable kernels, but it would be difficult to do because the patches are complex.

Another failure that turned up in both blktests and fstests somewhat randomly is an example of the low-hanging fruit, he said. The error came about because of a longstanding problem removing kernel modules; the test tried (and sometimes failed) to remove the scsi_debug module. The underlying bug will be fixed in kmod soon by adding a more patient module remover, but it points to another problem: fstests and blktests should not require modules to be unloaded so that unrelated problems do not introduce sporadic failures of this sort.

But others in the room said that it was important to ensure that the cleanup was done correctly, for example with NVMe devices. There was some discussion of whether that kind of testing was truly useful and whether module unloading was the right way to go about it, but no real consensus emerged. Josef Bacik said that it was important to focus on "testing the thing that we care about" and not to let unrelated problems muddy the waters by way of side effects.

There are also some problems with the error reporting in fstests, Chamberlain said. There are two kinds of files associated with each test, a .bad file and another in the Junit format, that do not always agree. So both types of files need to be processed in order to find the errors associated with a particular test. Blktests is better in that respect, he said, at least partly because it is a newer test suite.

Ted Ts'o said that tests with errors in one type of file and not the other are simply test bugs that should be fixed; the test runners could perhaps be changed to process both, as well, but the tests should be updated to have the right information. There was also some discussion of saving dmesg output when there are test failures. Bacik said that fstests has an option to always save that output even if the test passes, which can be useful; Omar Sandoval said that if blktests did not have a similar option, it would be added.

To try to investigate the failure rates of some of the tests, Chamberlain runs fstests and blktests in a loop for 100 iterations on each. For running fstests on all of the filesystems, that loop takes five or six days, while the loop takes a single day for blktests. Tests that do not pass for all of the test runs can be removed from the baseline while they are being investigated.

Ts'o cautioned that there are different goals for running these test suites, however. A QA person who is "trying for the platonic ideal of zero bugs" may have to do multiple runs looking for bugs that only appear infrequently. But, from his company's perspective, it does not make sense to try to detect those kinds of bugs since he does not have the budget to hire enough people to track them all down.

Instead, his testing focuses on running tests on the hardware that is being used in production to try to find the kinds of bugs that will occur in that scenario. So Ts'o said he has different goals than Chamberlain does, though the work that Chamberlain is doing is valuable. Ts'o said that he is trying to "maximize bang for the buck" to produce the highest-quality kernel he can afford given his budget. Chamberlain agreed that there is a need to prioritize the work based on the goals of the organizations involved, but as Ts'o noted, this kind of work requires lots of resources.

Moving onto another subject, establishing a baseline for a new filesystem takes one or two months, Chamberlain said. Not having a public baseline for a filesystem should be seen as a technical debt within the community. But it takes time and resources to investigate the test failures, so dropping failing tests to establish a "lazy baseline" is needed.

Another problem that he sees is that tests that should fail for a given configuration or filesystem should be annotated so that they can be run and the failure verified. But others disagreed, saying that known failures should be turned into separate tests to demonstrate the correct behavior. Bacik worried that it would simply introduce further uncertainty into the tests. The session ran out of time, but Bacik scheduled another session later in the day to discuss other problem areas for testing.

Comments (5 posted)

Adding an in-kernel TLS handshake

By Jake Edge
June 1, 2022
LSFMM

Adding support for an in-kernel TLS handshake was the topic of a combined storage and filesystem session at the 2022 Linux Storage, Filesystem, Memory-management and BPF Summit (LSFMM). Chuck Lever and Hannes Reinecke led the discussion on ways to add that support; they are interested in order to provide TLS for network storage and filesystems. But there are likely other features, such as QUIC support, that could use an in-kernel TLS implementation.

Problem

Reinecke started things off by saying that, while Lever was interested in the feature for NFS, he wanted it for NVMe. The problem is that those applications cannot use the current in-kernel TLS support because they need to initiate the handshake from the kernel, Reinecke said. Current kernels can communicate using TLS, but the connection handshake is done in user space, then the connected socket is passed to the kernel for sending and receiving the data.

The reason the existing mechanism cannot be used is because there is already a socket connected to the remote host within the kernel that is, effectively, being converted to use TLS. So there is a need to pass a connected socket from the kernel to user space if the handshake will be done there, but there is no existing mechanism to do that.

[Chuck Lever and Hannes Reinecke]

An alternative would be to do the whole job within the kernel, as a company called Tempesta has done, Reinecke said. That works, but it brings "a lot of security-relevant code" into the kernel, which would require an audit to help limit the potential security danger. Someone suggested writing that code in Rust; "we did think of that", Lever said with a chuckle. In any case, there are reasonable arguments that this kind of code should not be in the kernel at all, regardless of language, Reinecke said.

James Bottomley asked about using the kernel as a man in the middle and passing the packets back and forth to user space as needed. Reinecke said that does not work with the existing libraries; if the kernel endpoint can be passed to user space, there are TLS libraries that can just handle the handshake directly.

Steve French said that there is value in finding a way to create a guinea-pig implementation for dealing with the handshake as a starting point, even if that code never goes upstream. It would allow the creation of a reference platform that shows that TLS for NFS, NVMe, or, in his case, SMB over QUIC, is viable, then it can be reworked as needed. But there is no good example that he could find of an upcall passing the kernel socket to a user-space library.

Reinecke agreed; there is no mechanism of that sort, which is why they have been pondering on how it should be done. One possibility is to update the netlink mechanism to allow passing file descriptors from the kernel to user space. Josef Bacik said that the Linux network block device (NBD) already uses netlink that way, but Lever pointed out that user space creates the endpoint for NBD, not the kernel, so that is passing the socket in the opposite direction of what is needed here.

David Howells said that for TLS 1.3 all of the necessary code should already be available in the kernel crypto subsystem. It should just be a matter of calling it properly. But Reinecke said that the crypto layer does have what is needed for encrypting and decrypting the data, but it does not have necessary pieces for the initial handshake.

Bacik said that FreeBSD does the TLS handshake in user space and wondered how it did so. Lever said that it passed a file descriptor to a user agent that uses an existing library, probably OpenSSL, to do the handshake. That is generally how the security community recommends that it be handled.

On the server side, the kernel will be accepting connections from clients that will then need to have a TLS connection initialized, Lever said, so there is really no way of getting around the need to pass connected sockets to user space. His initial implementation used a separate address family for a user agent's socket; the user agent would accept a connection from the kernel on that socket, which "materializes the connected endpoint in the user agent's file descriptor table". That socket gets passed by the agent to GnuTLS, which does the handshake and closes the accepted socket; that tells the kernel that the connected endpoint is ready to use.

That prototype worked for NFS and NVMe. They are hoping to build infrastructure that QUIC can use, as well, since it uses the TLS 1.3 handshake protocol to establish connections.

Direction

There was quite a bit of pushback from the networking developers when they discussed doing the handshake directly from within the kernel, Lever said. Reinecke asked if it made sense to continue exploring that option or if the user-space solution was the best route. Bacik said that he is normally "extremely allergic" to putting that kind of code in the kernel, but since the crypto pieces are already there, it does not "seem like it's a big deal" to do so. Bottomley pointed out that it is just the primitives that are present in the kernel, however; TLS has "a huge amount of handshaking code" that is missing from the kernel.

Lever said that TLS 1.3 reduces the amount of code needed for handshaking by roughly half; both he and Reinecke only need support for 1.3. But Bottomley said that he had looked at the bug reports for OpenSSL, specifically regarding the 1.3 handshaking; the code size may be less, but there are still many bugs reported for it.

Chris Mason said that the TLS-for-storage developers were faced with "two different slogs" to choose from; one is to add the TLS handshake code to the kernel and the other is to figure out how to add the mechanism so that it can be done in user space. Both will be a lot of work, but the user-space solution will likely be better long-term. As security problems arise with TLS, for example, it will be easier to address them in user space. If it were him, Mason said, he would choose the user-space route.

Lever said that one area where they do not feel comfortable with the user-space solution is in handling a root filesystem or block device over TLS. The user agent process needs to be made special somehow so that the kernel can always rely on it being there if it needs to re-establish the TLS session—even when there is memory pressure, for example.

Another problem that Lever sees is how the kernel knows that it can trust the process it is talking to. The kernel is making an upcall, but how can it be sure that it is talking to what it expects? It is a more general problem that he does not think has been solved for other user-space helpers. Ted Ts'o said that it is the same problem faced by firmware and module loading within the kernel; the assertion is that /sbin/request_module is sane and a similar assertion could be made for the TLS user agent binary.

For a prototype and to work out any problems that may be encountered, it clearly makes sense to do the handshake in user space, Lever said. Every time he talks to a group of kernel developers, he feels like the chances of eventually moving that handling into the kernel dwindle. French suggested that, once there are consumers of the facility in the kernel, the networking developers may see that it makes sense to move that handling into the kernel. Reinecke agreed; it really is not a filesystem or storage topic, but something that the networking developers need to consider.

There are two big advantages that TLS brings, which makes it a "great value add for storage protocols", Lever said. It allows both servers and clients to authenticate the other end of the connection using X.509 certificates. It also provides in-transit encryption in a way that can be offloaded to specialized hardware. TLS is well-established in the industry, which makes it a good basis for an encryption feature.

The mechanism for passing the TLS information to the user agent is perhaps one of the more contentious pieces, Lever said. The prototype uses socket options for the new address family to pass the connection information. That allows the kernel to send certificate data, pre-shared keys, and other information specific to the TLS connection and handshake. It is seen as ugly by some of the reviewers of the prototype code, however.

The session wound down soon after that. It would seem that, at least for now, the same basic approach will be taken, though there are still multiple issues that need to be resolved.

Comments (21 posted)

The Clever Audio Plugin

May 30, 2022

This article was contributed by Alexandre Prokoudine

Our introduction to Linux audio and MIDI plugin APIs ended with a mention of the Clever Audio Plugin (CLAP) but did not get into the details. CLAP is an MIT-licensed API for developing audio and MIDI plugins that, its developers feel, has the potential to improve the audio-software situation on Linux. The time has now come to get to those details and look at the state of CLAP and where it is headed.

When CLAP resurfaced in late 2021 after years of radio silence, xkcd #927 references were a popular response in all discussions about it. But there are a number of serious questions to ask about this API as well. Does CLAP actually compete with the other audio APIs available on Linux, including VST3, LV2, and others? Is it a viable alternative? Does it address problems that developers have with other APIs?

The backstory

Alexandre Bique started working on CLAP in 2014. In a nutshell, he wasn't happy with the industry dominance of Steinberg's VST API, but he also wasn't happy with some LV2 design decisions. In 2015, Bique was hired by Bitwig as a software engineer, where he's been working ever since. That's pretty much when the project's activity ground to a halt for the first time. Bique then resumed development in early 2016, then stopped again in October 2016.

When Steinberg started aggressively retiring VST2, Urs Heckmann, the creator of the proprietary u-he synths, contacted Bique and told him that he liked the CLAP API's simplicity. He also asked if Bique wanted to finish it, to which Bique said "yes". So in April 2021, Bique resumed work on CLAP, and has been tirelessly hacking on it ever since. There have been multiple releases of the API in the last several months, for the first time since 2015.

Right now, v0.24.1 of the CLAP SDK is available (tagged in Git). The specification is close to being final. It might undergo some further revision, but Bique is mostly working on code examples now and expects to ship v1.0 soon.

What's different about CLAP

The reasoning behind the creation of CLAP is multifaceted, being at the same time ethical, legal, and technical. In a post at KVR, Heckmann claims that the main reason for CLAP's existence is its liberal licensing. He also mentions a strong governance motive:

As plug-in developers we always feel like our products are second-class citizens in the DAW [digital audio workstation] ecosystem, as if the plug-in standard forms a harness of what we can do and what we can't do. As someone expresses it so nicely, "as if the host puts its fingers into the plug-in and directs it". CLAP feels a lot more on equal level, which is already expressed by having a host object and a plug-in object. The host isn't a shapeless god entity, it's just another struct that we communicate with. [...]

Current formats are maintained by host developers with a conservative product philosophy. A very short conclusion here is that some of the people who support CLAP want to switch from the back seat to the steering wheel of what a host/plug-in environment could make possible. And I'm absolutely certain that the hosts and plug-ins which do the switch - not just adoption, but also implementation - will gain market share.

Other reasons listed by the CLAP team are technical: fast plugin scanning for hosts that need to update the list of available plugins, controlled multi-threading, metadata for plugin categorization, etc. All these features are also available in LV2, though.

CLAP also takes into consideration some criticism that LV2 received from developers over the years, and thus it has neither RDF 1.1 Turtle metadata (too lengthy to write by hand, requires build system enhancement when automated), nor versioned extensions (dealing with those is cumbersome). There are even more LV2-specific issues that CLAP does not appear to have, like heavy, under-documented APIs, design limitations that even the LV2 maintainer thinks call for a partial API deprecation, and more.

It's still difficult to make a case for or against CLAP as compared to LV2, because people arrive at new APIs with preconceptions and older APIs' baggage. I've definitely seen developers praising the ease of creating a CLAP plugin from scratch. And indeed, a single header file required to get started sounds like a good thing. Over to Heckmann:

Several JUCE based open source synthesizer plug-ins have been ported to CLAP, almost literally overnight (I have to check chat protocols, but I do think I went to bed with one ported and woke up to yet another ported).

On the other hand, Ardour's simplistic, built-in, LV2 reverb plugin is 462 lines of C code and maybe a hundred lines of Turtle metadata (that you can generate from the source). Which makes it nearly an overnight hack, if you know what you are doing.

One developer I talked to, Vadim Sadovnikov (an LSP developer), summarized CLAP as a "modern API that is nevertheless as easy to get started with as was/is VST2". He summarized some of the benefits of CLAP: its pure C interface, simple header files that you can easily drop into your project and start using, the ability to package multiple plugins into a single library file, and UTF-8 in everything.

Here is William Light's (of LHI audio) take on the subject, expressed in a private conversation:

VST3 is really the only cross-platform widely-supported option, and there remain outstanding technical issues with MIDI (particularly surrounding MIDI Polyphonic Expression). Steinberg has also shown themselves to be legally heavy-handed in a way that has made the plugin developer community uneasy. In my opinion, CLAP does fill a major need here – namely, a plugin API owned and managed by the community rather than a single company.

From a purely technical perspective, I find CLAP to be very thoughtfully architected. It has its idiosyncrasies, but so do all of the other plugin APIs, and, in my opinion, CLAP has fewer oddities than the other APIs, and they're less onerous.

It's important to note that Bique is strongly against this kind of comparison, though. In a private conversation, he stated:

I want CLAP to stand on its own. I don't claim that CLAP is better than VST3, AUえーゆー, LV2 or anything else. I don't want to go in the direction of CLAP vs XY because I think it is toxic.

Concerns

Navigating discussions on this API and talking to involved parties feels a lot like walking on thin ice. There are three major groups of people I have met: developers who think that CLAP is the next big thing, those who think it's a case of a not-invented-here syndrome, and developers who are in the wait-and-see camp. All of them are extremely outspoken.

Having said that, I have to add that, surprisingly, there isn't much analysis of CLAP available yet. A commonly used technical point I've seen across a few discussions is that CLAP doesn't allow for digital signal processing (DSP) options separated from the user interface. But there is not much agreement between developers about the need for that either.

An example of where this is useful right now are MOD multi-effect boxes: the DSP code is executed inside the box, and while you have physical knobs and encoders to change parameters, you get vastly more control over the effects graph in the web-based editor that runs in a browser on your laptop. For the vast majority of plugin users, though, it's more like an edge use case. Besides, should MOD units come to rely on CLAP rather than LV2, the use case could be covered by adding an extension to the specification.

In most cases, DSP/UI separation is a significant additional complication for both plugin and host developers. One developer I spoke to views the omission of DSP/UI separation in CLAP as a welcome simplification rather than a mistaken omission.

Another concern I've come across is that we are going to end up with Bitwig and u-he as dictators shaping CLAP as they please. However people behind CLAP hint about devising a plan to introduce some sort of a governance body that would make it impossible for one or two companies to have full control over the evolution of the API. Again, we'll have to wait and see what happens there.

But the biggest objection seems to be that the time spent on CLAP could be used to iron out bugs and problematic parts of LV2.

How to see CLAP in action

Today, if you want to see a real CLAP plugin working in a real host application like a digital audio workstation, you'll have to mix free and proprietary software.

Bitwig is currently the only digital audio workstation that has CLAP support in a released version, and it's proprietary. The feature is also hidden from users by default. This will change in the future; once the CLAP v1.0 release is available, Bitwig developers will enable its support by default. There's also some evidence that there is initial, currently unreleased support for CLAP plugins in Reaper, another proprietary digital audio workstation.

On the plugin side, u-he's MFM2 synth (proprietary) is available in CLAP as well but probably won't be be publicly updated until CLAP 1.0 is out. As MFM2 and Bitwig rely on different versions of the CLAP API, running the CLAP version of the synth in Bitwig won't work right now.

That is, however, not a problem with free plugins. The Surge family of GPLv3 plugins has preliminary CLAP support: these include Surge XT synth (and the effects stack), Monique synth, and Shortcircuit XT sampler. Two GPLv3 plugins from Jatin Chowdhury (Analog Tape Model and Build-Your-Own-Distortion) also have initial CLAP support. Several more plugins are available as part of the Robbert van der Helm's NIH-plug framework.

Potential for further adoption

There are several prerequisites for CLAP to become mainstream. In a world where tens of thousands of plugins and only a few dozens of mainstream host applications coexist, the success of CLAP largely depends on how many host applications will support it. So far, it has been mostly developers of a few proprietary digital audio workstations who are actively interested in adding CLAP support.

Will free hosts support CLAP? So far, developers are noncommittal on the topic.

Paul Davis has reservations about the need for CLAP but he also doesn't mind support for it being added to Ardour. Robin Gareus, another Ardour developer, has a good relationship with Bique and, in fact, got him to add optional support for zero-copy processing to the API. So there is no reason why the Ardour team would take a stance against CLAP even though they are heavily invested in LV2 (having participated in core LV2 spec development and created several extensions).

In a private conversation, Alexandros Theodotou didn't object to having support for CLAP in Zrythm in the future either, but he thinks it's too early to bother working on that. Rui Nuno Capela, of Qtractor fame, is familiar with the latest news on CLAP but "has no plans to pursuit yet another plugin API at all" (from private conversation). One of the Meadowlark DAW developers expressed interest in adding CLAP support and actually added some preliminary code to the project's engine. That said, this program is at a very early development stage. It's going to take a while for anybody to benefit from either CLAP or LV2 support in it.

In a nutshell, free host developers are likely to accept patches adding CLAP support, but less likely to work on CLAP themselves. The amount of work involved will vary from one host to another; most hosts out there support multiple plugin APIs, so they have an abstraction layer, and thus adding support for another API is not too much of a burden. But CLAP supports advanced features like polyphonic modulation in synths; if you want these features, that might mean writing entirely new code that touches more parts of the host's code than plugin-related code.

One other way to help CLAP adoption is to wrap CLAP plugins for existing APIs so that plugin developers do not have to care about hosts' support for CLAP. The host application would see something like, for example, a VST3 plugin, but there would be a CLAP plugin inside. There is a new project called claptrap that aims to give developers convenient tools to do just that.

The other prerequisite is support for CLAP in frameworks, like JUCE (GPLv3 + proprietary) and iPlug2 (open source, custom license), where you write code once and generate binary plugins for multiple plugin APIs. Here is a rather sensible observation by user "audiojunkie" on KVR:

Developers want to write once and be able to compile for all architectures, OSes, etc. JUCE is the de-facto standard and overall king in this realm. Without JUCE buy-in, a new format will just muddy up the waters for development and be another thing that users will think they want developers to support, and developers will hate it for that reason. If this format doesn't solve more than the Steinberg licensing problem, it's not likely to help anyone.

Currently, this is solved by CLAP's own JUCE6 extensions. There is, however, the host end of the issue because multiple sequencers and DAWs use JUCE for their user interface. Kjetil Matheussen, for example, refuses to add support for APIs to Radium if these APIs are unsupported by JUCE.

On the iPlug2 side, there is a somewhat active "clap" branch. There is also a new framework targeted at Rust developers specifically, NIH-plug, with support for VST3 and CLAP as API targets.

Overall, the situation is not hopeless at all. On the other hand, despite the project's age, these are still the early days and we are looking at years of development ahead.

How to get started

The project's home page now lists most relevant links for developers: the SDK, example host and plugin implementations, various bindings and helper projects, etc. There is neither a mailing list nor a forum for discussing CLAP yet; the conversation is currently scattered across multiple communication channels. The Surge team has a "clap-chatter" chat on its Discord server and, as far as I can tell from lurking there for the past few months, it seems welcoming to non-Surge developers.

In conclusion

CLAP is a case of a surprising rebirth fostered mainly by developers of proprietary software. The main reason it's on anyone's radar is because it's backed by u-he and, apparently, has support from Bitwig and Reaper — all of these companies and software projects are well known in the overall music industry. So that gets people's attention.

From the technical standpoint, there is a huge overlap between VST3, LV2, and CLAP. However, with regard to VST3, there are also ethical and business concerns at play. In the case of LV2, the disagreement appears to boil down to a difficult learning curve on the LV2 side, as well as on how it is run as a project, how gatekeepers respond to concerns of plugin developers, etc.

Watching CLAP unfold is going to be interesting or, in the worst case scenario, educational. It's not a particularly good health symptom for the industry that, almost 20 years since the beginning of the work on the GMPI plugin, developers still want to displace Steinberg with a truly open-source, community-driven plugin API. LV2 was not the success that had been hoped for, in this regard at least. Whether CLAP will fare better remains to be seen.

Comments (none posted)

Page editor: Jonathan Corbet
Next page: Brief items>>


Copyright © 2022, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds