Read-only filesystems in Docker and Kubernetes

You should harden your containerized workloads to minimize the attack surface of your overall application. Paying close attention while building your container images and applying proven patterns will reduce the risk of an attack while running your application in production. One of the simple practices to apply is setting the filesystem of your containers to read-only, and that’s exactly what we will cover in this article.

Read-only filesystems in Docker

Docker and compliant CLIs offer the read-only flag, which we can use when mounting the container’s filesystem as read-only. Let’s try it! Let’s start a new container from the official ubuntu image with an interactive TTY, to verify that the filesystem is read-only. The following docker run command uses -it (creates an interactive pseudo TTY and ensures STDIN is kept open) and --rm (removes the container automatically once exited):

docker run -it --rm --read-only ubuntu

root@9a3d486f6eda:/# mkdir app
mkdir: cannot create directory 'app': Read-only file system
root@9a3d486f6eda:/#

So far, so good. We got a new Ubuntu container. We can try to do some modifications to the filesystem e.g., creating new directories or modifying existing filesystem content. No matter which operation you try, the operating system will prevent changes and print the hint shown in the snippet above. This will work for some containerized workloads. However, the chances are good that containerized applications have to write information to the filesystem. Think of files to do state-locking or pid files. There are numerous reasons why applications may have to write to a specific location. A good example is NGINX, the popular webserver. Let’s try to start NGINX with and setting the filesystem to read-only:

# start the container with read-only fs
docker run -d -p 8080:80 --read-only nginx:alpine
58620670b568c4675d450fe947191b12c7d174e790f8e3617b171fee22767b92

# grab logs from the container
docker logs 5862
/docker-entrypoint.sh: /docker-entrypoint.d/ is not empty, will attempt to perform configuration
/docker-entrypoint.sh: Looking for shell scripts in /docker-entrypoint.d/
/docker-entrypoint.sh: Launching /docker-entrypoint.d/10-listen-on-ipv6-by-default.sh
10-listen-on-ipv6-by-default.sh: info: can not modify /etc/nginx/conf.d/default.conf (read-only file system?)
/docker-entrypoint.sh: Launching /docker-entrypoint.d/20-envsubst-on-templates.sh
/docker-entrypoint.sh: Launching /docker-entrypoint.d/30-tune-worker-processes.sh
/docker-entrypoint.sh: Configuration complete; ready for start up
2021/10/19 06:39:51 [emerg] 1#1: mkdir() "/var/cache/nginx/client_temp" failed (30: Read-only file system)
nginx: [emerg] mkdir() "/var/cache/nginx/client_temp" failed (30: Read-only file system)

As you can see, we got several logs complaining about the filesystem being read-only. Somehow, we must allow NGINX writing to the necessary files and directories to get it working again. This is where temporary filesystem (tmpfs) enters the stage.

Temporary filesystem (tmpfs) in Docker

In Docker, a temporary filesystem (tmpfs) works theoretically like a regular volume, which allows us to mount storage from outside the container to a particular location. However, a temporary filesystem is not persistent. It mounts an area of the host memory to a specific location in the container. That said, everything written to the temporary filesystem will be gone when the container gets terminated. On top of that, multiple containers can’t share the temporary filesystem.

But let’s revisit the logs created by nginx:alpine from the previous section. We saw that the webserver tried to modify the NGINX configuration file located at /etc/nginx/conf.d/default.conf. We will ignore this for now because we don’t want our webserver to apply runtime modifications to the default configuration file. Additionally, it tried to create the client_temp folder at /var/cache/nginx/client_temp, which also failed because of the filesystem was read-only. Let’s allow /var/cache/nginx/ for modifications by applying tmpfs to our docker run command:

# start container with read-only fs and tmpfs
docker run -d -p 8080:80 --read-only --tmpfs /var/cache/nginx/ nginx:alpine
287af36ed9757dc8cdfcdedfad74504b8f16dc779fc33c1b8fa630128af5673c

# grab logs from the container
docker logs 287

/docker-entrypoint.sh: /docker-entrypoint.d/ is not empty, will attempt to perform configuration
/docker-entrypoint.sh: Looking for shell scripts in /docker-entrypoint.d/
/docker-entrypoint.sh: Launching /docker-entrypoint.d/10-listen-on-ipv6-by-default.sh
10-listen-on-ipv6-by-default.sh: info: can not modify /etc/nginx/conf.d/default.conf (read-only file system?)
/docker-entrypoint.sh: Launching /docker-entrypoint.d/20-envsubst-on-templates.sh
/docker-entrypoint.sh: Launching /docker-entrypoint.d/30-tune-worker-processes.sh
/docker-entrypoint.sh: Configuration complete; ready for start up
2021/10/19 06:55:46 [notice] 1#1: using the "epoll" event method
2021/10/19 06:55:46 [notice] 1#1: nginx/1.21.3
2021/10/19 06:55:46 [notice] 1#1: built by gcc 10.3.1 20210424 (Alpine 10.3.1_git20210424)
2021/10/19 06:55:46 [notice] 1#1: OS: Linux 5.10.47-linuxkit
2021/10/19 06:55:46 [notice] 1#1: getrlimit(RLIMIT_NOFILE): 1048576:1048576
2021/10/19 06:55:46 [emerg] 1#1: open() "/var/run/nginx.pid" failed (30: Read-only file system)
nginx: [emerg] open() "/var/run/nginx.pid" failed (30: Read-only file system)

Bummer! It fails again because NGINX wants to create a PID-file. Let’s quickly remove the container using docker rm -f 287 and add another temporary filesystem pointing to /var/run :

# run contianer with read-only FS and tmpfs
docker run -d -p 8080:80 --read-only --tmpfs /var/cache/nginx/ --tmpfs /var/run/ nginx:alpine
c4f7f8f5a6ef1cf7f89093c2b0ef3c70cc05d7ce08fcfaa17c56f62adf50026f

# grab logs from the container
docker logs c4f

/docker-entrypoint.sh: /docker-entrypoint.d/ is not empty, will attempt to perform configuration
/docker-entrypoint.sh: Looking for shell scripts in /docker-entrypoint.d/
/docker-entrypoint.sh: Launching /docker-entrypoint.d/10-listen-on-ipv6-by-default.sh
10-listen-on-ipv6-by-default.sh: info: can not modify /etc/nginx/conf.d/default.conf (read-only file system?)
/docker-entrypoint.sh: Launching /docker-entrypoint.d/20-envsubst-on-templates.sh
/docker-entrypoint.sh: Launching /docker-entrypoint.d/30-tune-worker-processes.sh
/docker-entrypoint.sh: Configuration complete; ready for start up
2021/10/19 09:10:30 [notice] 1#1: using the "epoll" event method
2021/10/19 09:10:30 [notice] 1#1: nginx/1.21.3
2021/10/19 09:10:30 [notice] 1#1: built by gcc 10.3.1 20210424 (Alpine 10.3.1_git20210424)
2021/10/19 09:10:30 [notice] 1#1: OS: Linux 5.10.47-linuxkit
2021/10/19 09:10:30 [notice] 1#1: getrlimit(RLIMIT_NOFILE): 1048576:1048576
2021/10/19 09:10:30 [notice] 1#1: start worker processes
2021/10/19 09:10:30 [notice] 1#1: start worker process 24
2021/10/19 09:10:30 [notice] 1#1: start worker process 25
2021/10/19 09:10:30 [notice] 1#1: start worker process 26
2021/10/19 09:10:30 [notice] 1#1: start worker process 27

This looks better now. NGINX started successfully, although we still get the message that default.conf can’t be modified. We can quickly test our webserver by issuing an HTTP request to the forwarded port on the host system using any web browser.

NGINX with read-only filesystem - Welcome Page

Read-only filesystems in Kubernetes

Chances are quite good that you intend to run containerized workloads in Kubernetes. In Kubernetes, you can instruct the kubelet to run containers with a read-only filesystem by setting podSpec.containers.securityContext.readOnlyFilesystem to true. For demonstration purposes, we will again take an NGINX webserver and run it directly in Kubernetes using a regular Pod as shown here:

apiVersion: v1
kind: Pod
metadata:
  name: webserver
  labels:
    name: webserver
spec:
  containers:
  - name: webserver
    image: nginx:alpine
    securityContext:
      readOnlyRootFilesystem: true
    ports:
      - containerPort: 80

Having the filesystem set to read-only, we’ve somehow to add support for a temporary filesystem (tmpfs). In Kubernetes, we use ephemeral volumes to achieve this.

Ephemeral Volumes (aka tmpfs) for Kubernetes

Although Kubernetes offers several types of ephemeral volumes as described in its official documentation, we will use the simplest kind for this scenario, which is emptyDir. When we use emptyDir as volume, Kubernetes will attach a local folder from the underlying worker-node, which lives as long as the Pod.

Optionally, we can instruct Kubernetes to use the memory (RAM) from the worker-node as the source for the volume. This is controlled using the medium property of emptyDir. From an architectural perspective, ephemeral volumes look like shown in the following figure:

Let’s extend our Kubernetes manifest and provide two independent volumes. One for /var/run and the second one for /var/cache/nginx - as we’ve done previously with tmpfs in plain Docker:

# pod.yml
apiVersion: v1
kind: Pod
metadata:
  name: webserver
  labels:
    name: webserver
spec:
  containers:
    - name: webserver
      image: nginx:alpine
      securityContext:
        readOnlyRootFilesystem: true
      ports:
        - containerPort: 80
      volumeMounts:
        - mountPath: /var/run
          name: tmpfs-1
        - mountPath: /var/cache/nginx
          name: tmpfs-2
  volumes:
    - name: tmpfs-1
      emptyDir: {}
#   - name: tmpfs-ram
#     emptyDir:
#       medium: "Memory"
    - name: tmpfs-2
      emptyDir: {}

Let’s quickly deploy this manifest to Kubernetes and verify that the webserver can be accessed as expected:

# Deploy to Kubernetes
kubectl apply -f pod.yml
pod/webserver created

# Grab Logs from the webserver pod
kubectl logs webserver

/docker-entrypoint.sh: /docker-entrypoint.d/ is not empty, will attempt to perform configuration
/docker-entrypoint.sh: Looking for shell scripts in /docker-entrypoint.d/
/docker-entrypoint.sh: Launching /docker-entrypoint.d/10-listen-on-ipv6-by-default.sh
10-listen-on-ipv6-by-default.sh: info: can not modify /etc/nginx/conf.d/default.conf (read-only file system?)
/docker-entrypoint.sh: Launching /docker-entrypoint.d/20-envsubst-on-templates.sh
/docker-entrypoint.sh: Launching /docker-entrypoint.d/30-tune-worker-processes.sh
/docker-entrypoint.sh: Configuration complete; ready for start up
2021/10/19 10:30:49 [notice] 1#1: using the "epoll" event method
2021/10/19 10:30:49 [notice] 1#1: nginx/1.21.3
2021/10/19 10:30:49 [notice] 1#1: built by gcc 10.3.1 20210424 (Alpine 10.3.1_git20210424)
2021/10/19 10:30:49 [notice] 1#1: OS: Linux 5.4.0-1059-azure
2021/10/19 10:30:49 [notice] 1#1: getrlimit(RLIMIT_NOFILE): 1048576:1048576
2021/10/19 10:30:49 [notice] 1#1: start worker processes
2021/10/19 10:30:49 [notice] 1#1: start worker process 22
2021/10/19 10:30:49 [notice] 1#1: start worker process 23

# Create a port-forwarding
kubectl port-forward webserver 8081:80

Forwarding from 127.0.0.1:8081 -> 80
Forwarding from [::1]:8081 -> 80

Again, let’s use the web browser and hit our NGINX webserver at http://localhost:8081. At this point, you should again see the NGINX welcome page

Conclusion

Setting the container’s filesystem to read-only can quickly minimize the attack surface for containerized workloads. However, in a real-world scenario, chances are pretty good that applications may have to write to the filesystem at several locations. In this post, we walked through the process of allowing modifications for specific locations using tmpfs in Docker and ephemeral volumes in Kubernetes.

No matter which application you’re shipping in containers, you should always try to use read-only filesystems if possible and allow modifications only for known directories.

Thorsten Hans

Read-only filesystems in Docker and Kubernetes