Adding an in-kernel TLS handshake

By Jake Edge
June 1, 2022

Adding support for an in-kernel TLS handshake was the topic of a combined storage and filesystem session at the 2022 Linux Storage, Filesystem, Memory-management and BPF Summit (LSFMM). Chuck Lever and Hannes Reinecke led the discussion on ways to add that support; they are interested in order to provide TLS for network storage and filesystems. But there are likely other features, such as QUIC support, that could use an in-kernel TLS implementation.

Problem

Reinecke started things off by saying that, while Lever was interested in the feature for NFS, he wanted it for NVMe. The problem is that those applications cannot use the current in-kernel TLS support because they need to initiate the handshake from the kernel, Reinecke said. Current kernels can communicate using TLS, but the connection handshake is done in user space, then the connected socket is passed to the kernel for sending and receiving the data.

The reason the existing mechanism cannot be used is because there is already a socket connected to the remote host within the kernel that is, effectively, being converted to use TLS. So there is a need to pass a connected socket from the kernel to user space if the handshake will be done there, but there is no existing mechanism to do that.

An alternative would be to do the whole job within the kernel, as a company called Tempesta has done, Reinecke said. That works, but it brings "a lot of security-relevant code" into the kernel, which would require an audit to help limit the potential security danger. Someone suggested writing that code in Rust; "we did think of that", Lever said with a chuckle. In any case, there are reasonable arguments that this kind of code should not be in the kernel at all, regardless of language, Reinecke said.

James Bottomley asked about using the kernel as a man in the middle and passing the packets back and forth to user space as needed. Reinecke said that does not work with the existing libraries; if the kernel endpoint can be passed to user space, there are TLS libraries that can just handle the handshake directly.

Steve French said that there is value in finding a way to create a guinea-pig implementation for dealing with the handshake as a starting point, even if that code never goes upstream. It would allow the creation of a reference platform that shows that TLS for NFS, NVMe, or, in his case, SMB over QUIC, is viable, then it can be reworked as needed. But there is no good example that he could find of an upcall passing the kernel socket to a user-space library.

Reinecke agreed; there is no mechanism of that sort, which is why they have been pondering on how it should be done. One possibility is to update the netlink mechanism to allow passing file descriptors from the kernel to user space. Josef Bacik said that the Linux network block device (NBD) already uses netlink that way, but Lever pointed out that user space creates the endpoint for NBD, not the kernel, so that is passing the socket in the opposite direction of what is needed here.

David Howells said that for TLS 1.3 all of the necessary code should already be available in the kernel crypto subsystem. It should just be a matter of calling it properly. But Reinecke said that the crypto layer does have what is needed for encrypting and decrypting the data, but it does not have necessary pieces for the initial handshake.

Bacik said that FreeBSD does the TLS handshake in user space and wondered how it did so. Lever said that it passed a file descriptor to a user agent that uses an existing library, probably OpenSSL, to do the handshake. That is generally how the security community recommends that it be handled.

On the server side, the kernel will be accepting connections from clients that will then need to have a TLS connection initialized, Lever said, so there is really no way of getting around the need to pass connected sockets to user space. His initial implementation used a separate address family for a user agent's socket; the user agent would accept a connection from the kernel on that socket, which "materializes the connected endpoint in the user agent's file descriptor table". That socket gets passed by the agent to GnuTLS, which does the handshake and closes the accepted socket; that tells the kernel that the connected endpoint is ready to use.

That prototype worked for NFS and NVMe. They are hoping to build infrastructure that QUIC can use, as well, since it uses the TLS 1.3 handshake protocol to establish connections.

Direction

There was quite a bit of pushback from the networking developers when they discussed doing the handshake directly from within the kernel, Lever said. Reinecke asked if it made sense to continue exploring that option or if the user-space solution was the best route. Bacik said that he is normally "extremely allergic" to putting that kind of code in the kernel, but since the crypto pieces are already there, it does not "seem like it's a big deal" to do so. Bottomley pointed out that it is just the primitives that are present in the kernel, however; TLS has "a huge amount of handshaking code" that is missing from the kernel.

Lever said that TLS 1.3 reduces the amount of code needed for handshaking by roughly half; both he and Reinecke only need support for 1.3. But Bottomley said that he had looked at the bug reports for OpenSSL, specifically regarding the 1.3 handshaking; the code size may be less, but there are still many bugs reported for it.

Chris Mason said that the TLS-for-storage developers were faced with "two different slogs" to choose from; one is to add the TLS handshake code to the kernel and the other is to figure out how to add the mechanism so that it can be done in user space. Both will be a lot of work, but the user-space solution will likely be better long-term. As security problems arise with TLS, for example, it will be easier to address them in user space. If it were him, Mason said, he would choose the user-space route.

Lever said that one area where they do not feel comfortable with the user-space solution is in handling a root filesystem or block device over TLS. The user agent process needs to be made special somehow so that the kernel can always rely on it being there if it needs to re-establish the TLS session—even when there is memory pressure, for example.

Another problem that Lever sees is how the kernel knows that it can trust the process it is talking to. The kernel is making an upcall, but how can it be sure that it is talking to what it expects? It is a more general problem that he does not think has been solved for other user-space helpers. Ted Ts'o said that it is the same problem faced by firmware and module loading within the kernel; the assertion is that /sbin/request_module is sane and a similar assertion could be made for the TLS user agent binary.

For a prototype and to work out any problems that may be encountered, it clearly makes sense to do the handshake in user space, Lever said. Every time he talks to a group of kernel developers, he feels like the chances of eventually moving that handling into the kernel dwindle. French suggested that, once there are consumers of the facility in the kernel, the networking developers may see that it makes sense to move that handling into the kernel. Reinecke agreed; it really is not a filesystem or storage topic, but something that the networking developers need to consider.

There are two big advantages that TLS brings, which makes it a "great value add for storage protocols", Lever said. It allows both servers and clients to authenticate the other end of the connection using X.509 certificates. It also provides in-transit encryption in a way that can be offloaded to specialized hardware. TLS is well-established in the industry, which makes it a good basis for an encryption feature.

The mechanism for passing the TLS information to the user agent is perhaps one of the more contentious pieces, Lever said. The prototype uses socket options for the new address family to pass the connection information. That allows the kernel to send certificate data, pre-shared keys, and other information specific to the TLS connection and handshake. It is seen as ugly by some of the reviewers of the prototype code, however.

The session wound down soon after that. It would seem that, at least for now, the same basic approach will be taken, though there are still multiple issues that need to be resolved.

Index entries for this article
Kernel	Networking/Protocols
Security	Transport Layer Security (TLS)
Conference	Storage, Filesystem, Memory-Management and BPF Summit/2022

Adding an in-kernel TLS handshake

Posted Jun 1, 2022 23:13 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link] (11 responses)

So....

> socket(AF_INET, SOCK_STREAM, SOCK_TLS);

Adding an in-kernel TLS handshake

Posted Jun 2, 2022 10:29 UTC (Thu) by Lumag (subscriber, #22579) [Link] (10 responses)

I think that the TLS handshake should staty out of the kernel. It has lots of ugly corner cases, strange checks, etc. I'd prefer to have OpenSSL/GNUTLS userspace helper doing the job.

Adding an in-kernel TLS handshake

Posted Jun 2, 2022 12:13 UTC (Thu) by jlayton (subscriber, #31672) [Link] (6 responses)

That is simpler, but when memory pressure rears its ugly head this may turn out to be problematic. Suppose we're in a situation where we have no free memory for allocations and a bunch of dirty NFS pages that need to be cleaned at the same time the socket connection goes down.

Now we're in a situation where we may not be able to allocate memory for the userland helper to do the handshake without performing writeback, but we can't perform writeback until we can allocate the memory. Deadlock. There are several variations on this theme as well. You can try to do things like mlock all of the userland helper's memory, but that's probably impossible if we're going to rely on 3rd party libraries for the TLS implementation.

For the initial implementation, they're sort of ignoring this for now, but it could turn out to be very problematic down the road.

Adding an in-kernel TLS handshake

Posted Jun 2, 2022 14:23 UTC (Thu) by james (subscriber, #1325) [Link] (5 responses)

Surely if the connection goes down, either it's going to come right back up again, in which case TLS 1.3 session resumption is a thing which could reasonably be in the kernel, or the client is going to have to live with those dirty pages for a while?

And if the server gets into a position where it can't resume the session (so you need userspace to make a new connection), and the kernel simply can't free enough memory to do that, then you're pretty much out of memory anyway? At some point, if you want reliability, you need enough memory to make that possible.

Adding an in-kernel TLS handshake

Posted Jun 2, 2022 18:50 UTC (Thu) by jlayton (subscriber, #31672) [Link] (4 responses)

Yes, that's all true. Also, the kernel just overall better at avoiding these situations these days. It's more proactive about flushing and blocking new pages from being dirtied when things aren't being cleaned.

I agree that a userland implementation is definitely the way to go. We may need the daemon to be extra careful to avoid allocations in critical codepaths, which may be difficult depending on what the TLS libraries do under the hood.

Adding an in-kernel TLS handshake

Posted Jun 3, 2022 0:54 UTC (Fri) by NYKevin (subscriber, #129325) [Link] (3 responses)

Strictly speaking, can't the kernel mark the writes as bad even after it has accepted them, and return EIO on close/fsync? That's probably not very *nice*, but if the writes physically cannot be persisted anyway, you may as well let the application know that its data got lost.

But OTOH neither the man pages nor POSIX are very clear about what EIO even means or how userspace should react to it, so I imagine there are some applications that will freak out and do weird things if you return that error. Amazingly, POSIX does not even tell you what the state of the file descriptor is after close(2) fails with EIO, which means you have no way of knowing (assuming a POSIX-only environment that lacks /proc/self/fd) whether the file descriptor still exists and still needs to be closed! I guess the only safe way is to loop and repeatedly call close until you get EBADF? But that's obviously not thread-safe, and I could imagine a brain-dead implementation that just keeps returning EIO and never deallocates the fd.

Adding an in-kernel TLS handshake

Posted Jun 3, 2022 14:27 UTC (Fri) by Wol (subscriber, #4433) [Link] (1 responses)

> Strictly speaking, can't the kernel mark the writes as bad even after it has accepted them, and return EIO on close/fsync? That's probably not very *nice*, but if the writes physically cannot be persisted anyway, you may as well let the application know that its data got lost.

Have you *looked* at what happens to data once the write() call returns? The reality is that the kernel doesn't have a clue which application needs to be told, nor how to tell it.

It gets even worse once network/raid/luks/integrity/blahblah gets involved. As a simple example, let's say you're writing a file of one block to a ten-disk raid array. You need to read 40k from disk, recompute checksums, and write the whole lot back. If THAT goes wrong, how do you tell the application it just trashed some data that was written six months ago ... ?

Okay, that's a bit extreme, but once the application has launched the data on its journey to disk, it's very hard to work out some sane way to pass an error back up the unpredictable path the data has taken.

Cheers,
Wol

Adding an in-kernel TLS handshake

Posted Jun 3, 2022 15:50 UTC (Fri) by jlayton (subscriber, #31672) [Link]

> Have you *looked* at what happens to data once the write() call returns? The reality is that the kernel doesn't have a clue which application needs to be told, nor how to tell it.

Not true, at least not on modern kernels. We track writeback errors in a better way now such that if we get one, it's reported exactly once to fsync/msync on every fd that was open at the time that the error was recorded. Ditto for syncfs(2).

Adding an in-kernel TLS handshake

Posted Jun 3, 2022 15:45 UTC (Fri) by jlayton (subscriber, #31672) [Link]

Writeback errors are an option, but not a good one. Most applications can't handle them gracefully, so this usually means that the program dies or something equally awful...and in this case, the problem _should_ be temporary. We really don't want to return a writeback errors on fsync unless there really is no other option. As far as close(2) goes, we really ought not return writeback errors to it at all. The only "legitimate" error for close(2) is EBADF.

Adding an in-kernel TLS _1.3_ handshake

Posted Jun 2, 2022 12:48 UTC (Thu) by tialaramex (subscriber, #21167) [Link] (2 responses)

> lots of ugly corner cases, strange checks, etc.

This stance, which was apparently also raised at the talk, needs some fleshing out if it's to persuade me. Lets see an actual concrete list of, say, a half dozen "ugly corner cases".

What's proposed is specifically just TLS 1.3. In the PSK case, clients say "Hi, I want to use this PSK" and servers say "Can do" and we're off to the races. Some other cases are also this easy. Fallback is lots more complexity, but this proposal only wants TLS 1.3.

As I understand it, this is not intended to drop into your web browser or web server as a replacement for its implementation of TLS, it's targeting situations where we're currently using a plaintext wire protocol to underpin some kernel primitive and it would be better to speak TLS instead.

If anything, getting OpenSSL or GnuTLS involved opens the door to additional complexity, because hey, we have this full blown implementation, if the remote device says it speaks TLS 1.0 we should just roll with it, shouldn't we ? With an in-kernel TLS 1.3 solution the "Let's just do TLS 1.0 even though now all our security guarantees are destroyed" change isn't a LGTM patch to some userspace program, it's going to the LKML where hopefully somebody will just say "No".

Adding an in-kernel TLS _1.3_ handshake

Posted Jun 4, 2022 12:55 UTC (Sat) by james (subscriber, #1325) [Link] (1 responses)

Fallback is lots more complexity, but this proposal only wants TLS 1.3.

So what about when TLS 1.4 comes out? You'll need to adopt it, and then you'll need fallback to 1.3.

Also, will TLS versions be retired at some point even if some sites are still using them and will experience this as a regression? Probably best to get this sorted before anyone adopts it.

Adding an in-kernel TLS _1.3_ handshake

Posted Jun 7, 2022 0:02 UTC (Tue) by tialaramex (subscriber, #21167) [Link]

> So what about when TLS 1.4 comes out? You'll need to adopt it, and then you'll need fallback to 1.3.

The former would be a change to this work, and version negotiation would be part of that change.

I'm guessing you're not very familiar with how TLS works, fallback was a specific trick where things went wrong but instead of just giving up (which is the secure choice) you start over but with different assumptions about your peer. There are grave security problems with this trick, but it was necessary say ten years ago especially because people love badly designed middle boxes as "security". We've hopefully made so much of the handshake encrypted that the worst of that won't happen again, and also this kernel work isn't general web security, if we're talking to a device that's wired to the same 10G switch or whatever then hopefully the middle boxes aren't in the way.

Libraries like OpenSSL often support fallback, but a modern web browser no longer does (they ripped this out when they shipped TLS 1.2 as minimum version if they hadn't earlier) and there's no reason the kernel would either.

Normal version negotiation isn't thorny. "Hi, I can do X, Y, or Z" "Cool, let's do Y then". I'm also doubtful that we'd see a TLS 1.4 in the foreseeable future anyway. So we're asking about a hypothetical event maybe decades in the future.

Adding an in-kernel TLS handshake

Posted Jun 2, 2022 8:09 UTC (Thu) by tialaramex (subscriber, #21167) [Link]

It really seems like a lot of the vague concern was about older TLS versions, but the proponents don't actually want to talk anything other than TLS 1.3

Even for the OpenSSL bugs, all the pathways where OpenSSL TLS 1.3 support intersects with backward compatibility are irrelevant for the kernel. OpenSSL needs to cope with the case where we fall back correctly, but the kernel can immediately punt in all such cases.

However, I've been assuming they don't want to do certificate checking in the kernel, and there's a reference to certificates near the end so maybe I'm wrong. You definitely don't want to get into that in the kernel, it's a necessary complication for the Web but there's no reason the kernel needs such a broad high level policy that I can see. I think just PSKs (which needn't be associated with any certificates, we just "know" [from userspace] which PSKs to use) should be enough.

Adding an in-kernel TLS handshake

Posted Jun 3, 2022 15:55 UTC (Fri) by dkg (subscriber, #55359) [Link] (3 responses)

endpoint authentication seems like the real sticky part of this. Who does your side of the TLS session think it is talking to? who must it *not* talk to? If we're talking about a TLS connection used by an application, these are choices that only userspace is in a position to have answers for.

If the handshake is only for a bidirectionally-authenticated TLS pipe on the basis of a pre-shared key (PSK) then putting everything in the kernel makes sense. The semantics there are "we've already established a shared secret with a peer, and we want to bootstrap that into a confidential, integrity-protected bidirectional network channel with that same peer". That's something you can either do successfully or not at all, with minimal configuration choices from the side of either the initiating or receiving endpoints, and a straightforward set of error conditions.

Once you say "i want to use TLS to connect to foo.example.net on port 993, and the remote peer's certificate must be currently marked as valid for "foo.example.net", certified against Mozilla's X.509 root store, and my revocation-checking policy requires confirmation by either CRL or OCSP response within the last 72 hours; and ideally the connection would be made using the Encrypted Client Hello mechanism so that metadata about the name foo.example.net doesn't leak in the Server Name Indication extension" then there are way too many fiddly choices to expect the client to indicate to the kernel. And there are many different ways that it could go wrong, so error reporting is significantly more complex.

And that's just the client side. On the server side, where arbitrary clients might connect, and the authorization/privileges for any incoming client might differ based on the types of client authentication provided, or particular extensions used during the handshake. Reporting the relevant details of that part of the handshake back out of the kernel seems like a significantly complex (and possibly unstable) API surface that i'd be reluctant to adopt if i were maintaining any of this stuff.

Adding an in-kernel TLS handshake

Posted Jun 3, 2022 20:16 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link] (2 responses)

You can cover most of the cases by creating a keyring with trusted CA roots and supplying it as a socket option. Everything else (like custom certificate validation) can be done using the old mechanism of userspace handshake.

Adding an in-kernel TLS handshake

Posted Jun 4, 2022 10:49 UTC (Sat) by dkg (subscriber, #55359) [Link] (1 responses)

I agree that it you could pass a certificate bundle of trusted CA roots to the kernel, but that wouldn't be sufficient for a standard client connection. You'd also have to pass the name that you want to validate for the peer, as another socket option. Otherwise, any server certificate would be acceptable. for modern protocols like HTTP/2, you probably also need to offer a mechanism to set/require certain ALPN choices. But with those three changes, the client could just accept whatever other connection parameters the kernel chooses.

This depends of course on a plausible X.509 parser in the kernel, and X.509 path-finding code to map from the provided end-entity certificate through any provided intermediate CAs to one of the trusted roots. So the kernel will be dealing with and reasoning about a notoriously bizarre data format, with information provided from user space *and* from the remote network.

It would also mean if you wanted anything else, you'd need to revert to a userspace handshake. in particular, any of the above would require a userspace handshake:

- any sort of revocation check
- denylists of known-invalid CAs
- cached intermediate CAs (to be more likely to accept 'transvalid' end entity certificates)
- guidance on metadata leakage minimization (e.g. ECH)
- any sort of policy details negotiated in the handshake
- "early data" (data sent in the first flight based on assumptions about the TLS peer)

And you probably couldn't get significantly more information out of the kernel about the authenticated peer. Perhaps you could offer a getsockopt mechanism after the socket was connected that yields the validated end-entity certificate for the peer for clients that want to reason about the negotiated peer.

You could even have some sort of system-wide control that sets a default list of root CA certificates, which could be loaded from userspace by the superuser at runtime, which would permit someone using this mechanism to initialize the TLS layer without having to know its own preferred trust store.

So i'm agreeing with you -- it does seem like this is a plausible approach, above and beyond a PSK system. it would be inflexible, but that's not always a bad thing. Most applications really do just want a simple interface, and only the most sophisticated ones are willing to do the extra work to set up their own handshake.

That said, if i were trying to implement this, i'd start with a PSK handshake for both client and server. Then i'd add an anonymous (accepts any client, any configuration) server-side option, which needs sockopt mechanisms to provide secret key material and an X.509 cert chain. And only after that was all working would i consider how to do a minimal client-side, authenticated server mechanism.

Adding an in-kernel TLS handshake

Posted Jun 6, 2022 14:27 UTC (Mon) by mathstuf (subscriber, #69389) [Link]

> - any sort of revocation check
> - denylists of known-invalid CAs
> - cached intermediate CAs (to be more likely to accept 'transvalid' end entity certificates)

These really seem to be something the keyctl subsystem could be used for, no? A service inserts the valid CAs on boot (with reasonable expiry times), known-bad CAs are either blocked there or more proactively in some other keyring. Intermediate CAs can be cached in either a system-wide, per-user, or per-process keyring as appropriate for the use case.

Adding an in-kernel TLS handshake

Posted Jun 4, 2022 20:07 UTC (Sat) by aaronmdjones (subscriber, #119973) [Link] (2 responses)

Forgive my ignorance, but what does TLS have to do with NVMe?

Adding an in-kernel TLS handshake

Posted Jun 4, 2022 20:28 UTC (Sat) by zev (subscriber, #88455) [Link] (1 responses)

Probably for NVMeoF, I'd guess?

Adding an in-kernel TLS handshake

Posted Jun 9, 2022 13:26 UTC (Thu) by wagi (subscriber, #57912) [Link]

yes, this is for NVMe over TCP

Adding an in-kernel TLS handshake

Posted Oct 10, 2022 2:08 UTC (Mon) by lucien.xin (guest, #160913) [Link]

I've written an implementation for this topic, though there are some security problems. FWIW, just in case some one wants to know more about this, see https://github.com/lxin/tls_hs/