Building Docker Images Without Root or Privilege Escalation

Granting minimal privileges is always a best practice and it is often a requirement. Running as a user other than root, and then not allowing privilege escalation to root, are common guardrails within the principle of least privilege. For many tasks, those guardrails work well, but for building docker images, they are problematic. Building a docker image from a Dockerfile with these constraints requires a creative approach involving virtualization.

Table of Contents

Principle of Least Privileges: Why Root Should Be Disallowed
How Root and Privilege Escalation Are Disallowed
Building Docker Images with Buildpacks Works
Building Docker Images from Dockerfiles Does Not Work
Getting Creative: Running the Build in a Virtual Machine
Building the Docker Image
Using the Docker Image to Build a Dockerfile without Root or Privilege Escalation
Improving Performance with KVM

Principle of Least Privileges: Why Root Should Be Disallowed

The principle of least privilege is a security concept that states a user or process should only be granted the minimum level of access necessary to perform their required tasks, meaning they should only have access to the specific data and resources needed to complete their job, and nothing more.

Implementations of the principle of least privilege involve network firewall rules, file system permissions, network share restrictions, CPU time limits, memory limits, and more. Many of these guardrails present challenges in terms of compliance, but they’re usually relatively easily surmounted. For example, ensuring appropriate file system permissions are in place, credentials are used appropriately, and network proxies are used as required tends to allow tasks to fit within the guardrails.

One of the more challenging guardrails to meet, and also most rewarding in terms of security posture improvements, is to disallow running as root. The root user has a lot of power, such as accessing hardware, altering fundamental system configurations, making any network request, and bypassing local filesystem permissions.

Due to its immense powers, running a task as a limited user is greatly preferred. Security frameworks such as the CIS Benchmarks have rules prohibiting the use of root. Furthermore, running a task as a limited user but allowing the task to escalate privileges to root (using a tool such as sudo, su, doas, etc) is also not ideal. As always, there are exceptions to this rule (some tasks legitimately must run as root); as a general rule, it’s best to deny superuser privileges.

How Root and Privilege Escalation Are Disallowed

The specific way to implement root and privilege escalation disallowance varies by tool. The Linux kernel provides the “No New Privileges” flag upon which all tools are built.

For systemd units, a User other than root should be specified and NoNewPrivileges=true set.

For Kubernetes, set runAsNonRoot: true and allowPrivilegeEscalation: false on the security context. When runAsNonRoot: true is set, the USER in the docker image or the security context runAsUser must be set to a numeric value (when a username is specified, it may be mapped to UID 0, which would make that user root). To ensure compliance at a cluster level, Kyverno policies can be used, such as Require runAsNonRoot and Disallow Privilege Escalation. The Restricted Pod Security Standard also enforces these rules.

Building Docker Images with Buildpacks Works

There are numerous ways to build docker images. Some work fine under these restrictions, and others require creative solutions.

Buildpacks are a great way to build docker images and they don’t require root at all, therefore working fine under these restrictions. pack cli can build an image without root privileges, as can Spring Boot’s bootBuildImage.

Building Docker Images from Dockerfiles Does Not Work

Building an image from a Dockerfile is more tricky.

Kaniko can build a docker image from a Dockerfile without root and privilege escalation, but Kaniko is unmaintained and has several unmitigated security issues.

Podman/Buildah (Podman bundles Buildah using it to build images) and Docker/BuildKit are the de facto ways to build Dockerfiles, but both require running as root or running as non-root with privilege escalation permitted. Docker’s Rootless Mode explains the situation:

Rootless mode does not use binaries with SETUID bits or file capabilities, except newuidmap and newgidmap, which are needed to allow multiple UIDs/GIDs to be used in the user namespace.

Using SETUID or file capabilities for privilege escalation are not permitted when privilege escalation is disallowed. An attempt to use rootless buildkit with privilege escalation disallowed results in this error: error: [rootlesskit:parent] error: failed to setup UID/GID map: newuidmap 10 [0 1000 1 1 100000 65536] failed: newuidmap: Could not set caps: exit status 1. Rootless buildah similarly fails.

Building a Dockerfile intrinsically requires setting up user namespaces which requires privilege escalation… so there is no way to build a Dockerfile without privilege escalation. Or is there?

Getting Creative: Running the Build in a Virtual Machine

We’ve established that building Dockerfile into a docker image is impossible without privilege escalation, so what if we could allow privilege escalation but continue to disallow it at the same time? Virtualization (or emulation) allows running a computer as a regular task with no special privileges. Isolation for security purposes is a major benefit of virtualization and is why it was invented, so using it in this way aligns with the purpose of the tool.

The Kubernetes pod, for example, continues to have all restrictions applied to it, the pod runs QEMU, and the virtual machine runs Linux which builds the docker image. All of the network/filesystem/other restrictions that apply to the pod continue to apply within the virtual machine. The virtual machine is just another program, running as the user of the pod just like anything else, so it cannot do anything that the pod cannot do. Once the virtual machine builds the docker image, it can be exported as a file or pushed to a docker registry.

To implement this approach, a docker image containing QEMU and a VM image must be built, and the VM image must contain buildkit (or buildah, as preferred) and the tools necessary to run it.

Building the Docker Image

This implementation builds on qemu-docker, which already does the tedious work of providing, configuring, and maintaining QEMU within a docker image. It also relies on Alpine to produce a small, maintainable Linux kernel and initramfs for the VM.

QEMU is configured to use a microvm (inspired by Amazon’s firecracker which powers Lambda) and virtio devices for minimal start up time and maximum performance. 9p is used for sharing files (such as the Dockerfile to be built and the files it references) between the VM and host.

Note that best practices as followed for these scripts. Linters, which are key to secure, maintainable, quality DevSecOps, are run including shellcheck and Docker Build Checks. Docker image digests are used.

Save these files in a new directory.

init.sh: Runs as init inside the QEMU virtual machine.

#!/bin/sh
# shellcheck shell=busybox
set -euo pipefail

export "PATH=/usr/local/bin:/usr/local/sbin:/usr/bin:/usr/sbin:/bin:/sbin"
mount cgroup /sys/fs/cgroup -t cgroup
mount -o remount,rw,noatime /

mkdir -p /app
mount -t 9p -o trans=virtio,nodev,msize=5000000,cache=loose app /app

mkdir -p /variables
mount -t 9p -o trans=virtio,nodev,msize=5000000,cache=loose variables /variables

on_exit()
{
    status=$?
    echo "$status" > /variables/exit_code
    poweroff -f
}
trap on_exit EXIT

echo /dev /dev/shm /dev/pts /proc /sys /sys/fs/cgroup /run | xargs -n1 | xargs -I{} sh -c 'mkdir -p /buildkit{} && mount -o bind,ro,nosuid,nodev,noexec {} /buildkit{}'

# When https://gitlab.alpinelinux.org/alpine/mkinitfs/-/merge_requests/184 is merged and available,
# Modify entrypoint.sh (see comments in that file)
# and remove this block
ifconfig eth0 10.0.2.15 netmask 255.255.255.0
ip route add 0.0.0.0/0 via 10.0.2.2 dev eth0
echo "nameserver 10.0.2.3" > /etc/resolv.conf

/build-image.sh

entrypoint.sh:

#!/bin/sh
# shellcheck shell=busybox
set -euo pipefail

if [ -z "${CI_PROJECT_DIR:-}" ]; then
    >&2 echo "CI_PROJECT_DIR must be set to the directory containing the files necessary to build the docker image."
    exit 1
fi

if [ ! -d "${CI_PROJECT_DIR}" ]; then
    >&2 echo "CI_PROJECT_DIR is not set correctly. The directory ${CI_PROJECT_DIR} does not exist."
    exit 1
fi

if [ -z "${DOCKER_AUTH_CONFIG:-}" ]; then
    >&2 echo "DOCKER_AUTH_CONFIG must be set. See https://docs.gitlab.com/ee/ci/docker/using_docker_images.html#determine-your-docker_auth_config-data"
    exit 1
fi

if [ -z "${CI_JOB_URL:-}" ]; then
    >&2 echo "CI_JOB_URL must be set."
    exit 1
fi

if [ -z "${CI_PROJECT_URL:-}" ]; then
    >&2 echo "CI_PROJECT_URL must be set."
    exit 1
fi

if [ -z "${DOCKER_FILE:-}" ]; then
    >&2 echo "DOCKER_FILE must be an absolute path or path relative to ${CI_PROJECT_DIR} for the Dockerfile."
    exit 1
fi

DOCKER_FILE="$(realpath -s --relative-to="${CI_PROJECT_DIR}" "${DOCKER_FILE}")"

if [ ! -f "${DOCKER_FILE}" ]; then
    >&2 echo "DOCKER_FILE is not set correctly. The file ${DOCKER_FILE} does not exist."
    exit 1
fi

if [ -z "${DOCKER_IMAGE:-}" ]; then
    >&2 echo "DOCKER_IMAGE must be set to the fully qualified name (including registry, name, and tag) where the image will be pushed."
    exit 1
fi

if [ -z "${BUILD_CONTEXT:-}" ]; then
    >&2 echo "BUILD_CONTEXT must be an absolute path or path relative to ${CI_PROJECT_DIR} of the directory to use as the build context."
    exit 1
fi

BUILD_CONTEXT="$(realpath -s --relative-to="${CI_PROJECT_DIR}" "${BUILD_CONTEXT}")"

if [ ! -d "${BUILD_CONTEXT}" ]; then
    >&2 echo "BUILD_CONTEXT is not set correctly. The directory ${BUILD_CONTEXT} does not exist."
    exit 1
fi

variables_directory="$(mktemp -d -t vars.XXX)"

on_exit()
{
    if [ -f "${variables_directory}/exit_code" ]; then
        status="$(cat "${variables_directory}/exit_code")"
    else
        >&2 echo "Failed to run buildkit"
        status=1
    fi
    rm -rf "${variables_directory}"
    exit "$status"
}
trap on_exit EXIT

echo "${DOCKER_AUTH_CONFIG}" > "${variables_directory}/DOCKER_AUTH_CONFIG"
echo "${CI_JOB_URL}" > "${variables_directory}/CI_JOB_URL"
echo "${CI_PROJECT_URL}" > "${variables_directory}/CI_PROJECT_URL"
echo "${DOCKER_FILE}" > "${variables_directory}/DOCKER_FILE"
echo "${DOCKER_IMAGE}" > "${variables_directory}/DOCKER_IMAGE"
echo "${BUILD_CONTEXT}" > "${variables_directory}/BUILD_CONTEXT"

if [ -c /dev/kvm ]; then
    echo "Using KVM acceleration."
    kvm="yes"
    cpu="host"
else
    echo "Not using KVM acceleration. Performance will drastically suffer."
    echo "To use KVM acceleration, the /dev/kvm device must exist and be readable and writable by the current user."
    kvm=""
    # Software is starting to target x86-64-v2 as a baseline, not running if
    # that baseline is unmet.
    # x86-64-v3 is safe as almost all 2015 or later CPUs meet this baseline.
    # x86-64-v4 is not safe as it requires AVX512 which is not available in
    # 2022 and later Intel CPUs.
    cpu="Skylake-Client-v4"
fi

# When https://gitlab.alpinelinux.org/alpine/mkinitfs/-/merge_requests/184 is merged and available,
# Add 'ip=10.0.2.15::10.0.2.2:255.255.255.0:guest:eth0::10.0.2.3:10.0.2.3' to the -append argument
# and remove the block in the init file

qemu-system-x86_64 \
    ${kvm:+--enable-kvm} \
    --cpu "${cpu}" \
    -M microvm \
    -nographic \
    -no-user-config \
    -no-reboot \
    -m 1500 \
    -smp 2 \
    -kernel /var/lib/qemu-docker/vmlinuz-virt \
    -initrd /var/lib/qemu-docker/initramfs-virt \
    -append 'quiet earlyprintk=ttyS0 console=ttyS0 reboot=t panic=-1 root=/dev/vda rootfstype=ext4 modules=9p mitigations=off' \
    -drive id=buildkit,file=/var/lib/qemu-docker/buildkit.qcow2,format=qcow2,if=none,cache=unsafe \
    -device virtio-blk-device,drive=buildkit \
    -device virtio-rng-device \
    -fsdev "local,path=${CI_PROJECT_DIR},security_model=none,id=app,readonly=on" \
    -device virtio-9p-device,fsdev=app,mount_tag=app \
    -fsdev "local,path=${variables_directory},security_model=none,id=variables" \
    -device virtio-9p-device,fsdev=variables,mount_tag=variables \
    -netdev user,id=unet \
    -device virtio-net-device,netdev=unet \
    -net user \
    || :

Dockerfile:

# syntax=docker/dockerfile:1.13.0@sha256:426b85b823c113372f766a963f68cfd9cd4878e1bcc0fda58779127ee98a28eb
# check=error=true

FROM alpine:3.21.2@sha256:56fa17d2a7e7f168a043a2712e63aed1f8543aeafdcee47c58dcffe38ed51099 AS builder

# Size of the disk used in the VM.
# The docker image is built on this disk so the disk size limits the size of the image that can be built.
# If building docker images using this docker image fails due to running out of disk space, increase this value.
# Note that since qcow2 (which is a sparse) is used as the disk image format, changing the disk size doesn't (significantly) change the size of this docker image.
ARG DISK_SIZE=100G

ARG BUILDKIT_IMAGE=moby/buildkit:v0.19.0@sha256:14aa1b4dd92ea0a4cd03a54d0c6079046ea98cd0c0ae6176bdd7036ba370cbbe

SHELL ["/bin/sh", "-euo", "pipefail", "-c"]

WORKDIR /build

COPY build-image.sh init.sh /build/

RUN &lt;&lt;EOF

apk add --no-cache crane curl e2fsprogs e2tools qemu-img linux-virt

# Build initramfs
echo 'features="9p base ext4 kms network virtio"' > /etc/mkinitfs/mkinitfs.conf
kernel="$(basename $(find "/lib/modules/"* -maxdepth 0))"
/sbin/mkinitfs "$kernel"

# get the buildkit docker image and convert it to a disk image
mkdir /buildkit
crane export --platform "linux/amd64" "${BUILDKIT_IMAGE}" - | tar -C /buildkit -xf -
cp build-image.sh /buildkit
cp init.sh /buildkit/sbin/init
mke2fs -t ext4 -d /buildkit buildkit.raw "$DISK_SIZE"
qemu-img convert -f raw buildkit.raw -O qcow2 buildkit.qcow2
rm buildkit.raw

EOF

FROM qemux/qemu-docker:6.13@sha256:33c41464ef2bc8207d23d254d0c00c9e8da9b50800fc6e84554191b10402088c

SHELL ["/bin/sh", "-euo", "pipefail", "-c"]

WORKDIR /var/lib/qemu-docker/
COPY --from=builder --chmod=0666 /boot/vmlinuz-virt /boot/initramfs-virt /build/buildkit.qcow2 /var/lib/qemu-docker/
COPY --chmod=0555 entrypoint.sh /var/lib/qemu-docker/

RUN useradd --uid 1005 --create-home --shell /usr/sbin/nologin qemu

USER 1005

ENTRYPOINT ["/usr/bin/tini", "-s", "/var/lib/qemu-docker/entrypoint.sh"]

build-image.sh

#!/bin/sh
# shellcheck shell=busybox
set -euo pipefail

CI_JOB_URL="$(cat /variables/CI_JOB_URL)"
CI_PROJECT_URL="$(cat /variables/CI_PROJECT_URL)"
DOCKER_FILE="$(cat /variables/DOCKER_FILE)"
DOCKER_IMAGE="$(cat /variables/DOCKER_IMAGE)"
BUILD_CONTEXT="$(cat /variables/BUILD_CONTEXT)"
mkdir -p "${HOME}/.docker"
cp "/variables/DOCKER_AUTH_CONFIG" "${HOME}/.docker/config.json"

cd /app

SOURCE_DATE_EPOCH=1 buildctl-daemonless.sh build \
          --progress=plain \
          --output rewrite-timestamp=true,type=image,oci-mediatypes=false,push=true,name="$DOCKER_IMAGE" \
          --frontend dockerfile.v0 \
          --local context="${BUILD_CONTEXT}" \
          --local dockerfile=. \
          --opt filename="${DOCKER_FILE}" \
          --opt attest:provenance=mode=max,builder-id="${CI_JOB_URL}" \
          --opt label:org.opencontainers.image.source="${CI_PROJECT_URL}"

Build the docker image:

docker build -t qemu-buildkit .

The image can be built by itself (dogfooding).

You can push the image to a docker registry for use with a Kubernetes cluster, GitLab job, or anything else.

Using the Docker Image to Build a Dockerfile without Root or Privilege Escalation

The quickest and easiest way to test this out is to cd to a directory containing a Dockerfile then run:

docker run -it -e CI_JOB_URL=http://ci.example.com/job1234 -e CI_PROJECT_URL=http://myproject.example.com -e BUILD_CONTEXT=. -e DOCKER_FILE=Dockerfile -e DOCKER_IMAGE=docker.io/example/myimage -e DOCKER_AUTH_CONFIG='{"auths":{"docker.io":{"auth":"base64ofusernamecolonpassword"}}}' -e CI_PROJECT_DIR=/app -w /app -v "$(pwd):/app:Z" qemu-buildkit

There are a number of required environement variables:

CI_JOB_URL: Used as builder id in SLSA provenance attached to the image built
CI_PROJECT_URL: Set as the org.opencontainers.image.source label on the image built
CI_PROJECT_DIR: Path within the docker image that contains the Dockerfile to be built.
BUILD_CONTEXT: Directory to set as the build context when building the docker image. Usually set to ..
DOCKER_FILE: Name of the Dockerfile file to build. Usually set to Dockerfile. This file is an absolute path or a path resolved relative to CI_PROJECT_DIR.
DOCKER_IMAGE: fully qualified name (including registry, name, and tag) where the image will be pushed.
DOCKER_AUTH_CONFIG: JSON containing the credentials used to access docker registries. Must contain credentials used to push the image to the name provided in DOCKER_IMAGE. GitLab provides documentation for how to setup DOCKER_AUTH_CONFIG.

Modify the scripts as necessary to best suit your specific needs and requirements.

The image is now ready to be used in Kubernetes, GitLab jobs, GitHub Actions, Jenkins, and anywhere else.

Improving Performance with KVM

You may notice the message “Not using KVM acceleration. Performance will drastically suffer.” in the job output. When building simple, small docker images, the slowdown isn’t terrible, adding just a few seconds compared to a normal build. However, for larger images or images that involve more work to build, the slowdown is drastic, on the order of 10x or more time required. However, we can (usually) address this performance issue.

Even if the slowdown is significant, this approach at least allows for building images, a task that is otherwise impossible under these conditions.

KVM is virtualization, enabling the use of the CPU’s virtualization instructions along with other features. Without the availability of KVM, QEMU falls back to full emulation. For comparison, emulation means every CPU instruction is run in software, versus virtualization, where many CPU instructions run natively with only special or unsafe ones intercepted and handled by QEMU. Virtualization is far less overhead than emulation, with virtualized workloads running at essentially native speed.

Virtualization, specifically Linux KVM, is the foundation of the cloud; for example, AWS runs on KVM. KVM is widely used, secure, and well-supported.

On Kubernetes, Generic Device Plugin is a common way to expose devices, such as KVM, to pods. Only /dev/kvm needs to be exposed.

For GitLab Kubernetes executors, the GitLab runner configuration has to be altered.

To expose KVM to the container using docker run, pass --device /dev/kvm. For example, the test command previously provided altered to use KVM is:

docker run -it -e CI_JOB_URL=http://ci.example.com/job1234 -e CI_PROJECT_URL=http://myproject.example.com -e BUILD_CONTEXT=. -e DOCKER_FILE=Dockerfile -e DOCKER_IMAGE=docker.io/example/myimage -e DOCKER_AUTH_CONFIG='{"auths":{"docker.io":{"auth":"base64ofusernamecolonpassword"}}}' -e CI_PROJECT_DIR=/app -w /app -v "$(pwd):/app:Z" --device /dev/kvm qemu-buildkit

Now enjoy performant, secure, Dockerfile builds.

Building Docker Images Without Root or Privilege Escalation by Craig Andrews is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.