This page describes how vpnkit
is used by Docker for Mac and Docker for Windows to
open ports on host interfaces and to forward traffic to containers.
Note: this is completely separate from using vpnkit as a default gateway. This implementation could be factored out into a separate executable.
On a regular installation of docker
on Linux the command:
docker run -p 8080:80 nginx
starts an nginx
container and forwards connections from 0.0.0.0:8080
on
the host to the container's port 80
.
The command:
docker run -p 1.2.3.4:8080:80 nginx
starts an nginx
container but only forwards connections from 1.2.3.4:8080
on
the host to the container's port 80
.
On Linux port forwarding can be achieved either through iptables
or by
running a simple user-space proxy. On Docker for Mac and Docker for Windows
- the
docker
daemon does not run on the host (except on Windows when running Windows containers-- this is out of scope of this document) and so cannot run anything on the host - the container network within the VM is separated from the host's network
by
vpnkit
(ignoring the internal network used on Windows for volume sharing): see using vpnkit as default gateway.
Therefore vpnkit
includes a port forwarding service which allows the docker
daemon
in the VM to open ports on the host and which forwards connections transparently
to the container port inside the VM.
The docker
daemon can either use iptables
or a userspace proxy to open ports
on a regular Linux system. In Docker for Mac and Docker for Windows we configure
the docker
daemon to use a userspace proxy, and we provide our own custom
implementation. This acts as a very basic "plugin".
For example, after running docker run -p 8080:80 nginx
on the host, inside the
VM we can see a process:
/usr/bin/slirp-proxy -proto tcp -host-ip 0.0.0.0 -host-port 8080 -container-ip 172.17.0.2 -container-port 80
This shows a single port forward from 0.0.0.0:8080
on the host to the internal
IP 172.17.0.2:80
inside the VM.
This custom proxy uses a custom signaling protocol to communicate the port forwarding
request to the host, and to receive back success or error (e.g. EADDRINUSE
or EADDRNOTAVAIL
).
The control interface takes the form of a virtual 9P filesystem served by
vpnkit
and mounted in the Linux VM. New port forwards are requested by
creating directories, and status (including error messages) read by reading
files.
For example, after running docker run -p 8080:80 nginx
on the host, inside
the VM we can see:
/ # ls /port
README tcp:0.0.0.0:8080:tcp:172.17.0.2:80
This shows a single active port forward from 0.0.0.0:8080
on the host to the internal
IP 172.17.0.2:80
inside the VM.
The initial filesystem mount is slightly different between Docker for Mac and Docker for Windows.
The hyperkit hypervisor on the Mac has a
virtio-9p
device which connects to a Unix domain socket whenever the VM issues a mount
command.
On Windows we run a process 9p-mount
which calls listen
on a Hyper-V socket for
connections from the host. vpnkit
calls connect
, and then 9p-mount
calls
accept
and then passes
the file descriptor to the mount
command via the rfdno
and wfdno
arguments.
When it has opened a new host port forward, the custom proxy retains an open
file descriptor referencing a control file on the filesystem. If the proxy
is killed or crashes the Linux kernel will close the file descriptor and emit
a 9P clunk
message which is used by vpnkit
to shut down the port forward.
This ensures that the port forwards in vpnkit
do not leak.
Note: the use of a mounted filesystem is not ideal, for if the vpnkit
process
is restarted then the filesystem becomes broken. It would be better in future
to use a reconnectable protocol.
Note: the use of clunk
like this is quite fragile; if another process were
to open
and close
the file it would prematurly call clunk
and shutdown
the port forward.
When a client connects to the port on the host, vpnkit
accepts the connection.
On Windows, vpnkit
calls connect
on a Hyper-V socket to connect to the VM
on a well-known port.
On the Mac, vpnkit
calls connect
on a Unix domain socket to connect to the
hyperkit hypervisor's virtio-vsock
control
socket. vpnkit
writes a short header including the well-known AF_VSOCK
port number and is connected to the VM.
Inside
the VM there is a connection demultiplexer which calls listen
on this well-known port.
This process calls accept
and then reads a simple header which includes the
ultimate destination IP and port (172.17.0.2:80
in the example above).
The demultiplexer calls connect
to the container port and starts proxying data.
Note: since early versions of Windows 10 do not support shutdown
(i.e. the
ability to signal that write
calls have finished but while allowing read
calls to continue
c.f. TCP half-close) there is a simple protocol layered over the Hyper-V socket
stream which implements this behaviour, not described here.
Note: the code also transports UDP with a simple framing protocol, not described here.