Granting minimal privileges is always a best practice and it is often a requirement. Running as a user other than root, and then not allowing privilege escalation to root, are common guardrails within the principle of least privilege. For many tasks, those guardrails work well, but for building docker images, they are problematic. Building a docker image from a Dockerfile
with these constraints requires a creative approach involving virtualization.
- Principle of Least Privileges: Why Root Should Be Disallowed
- How Root and Privilege Escalation Are Disallowed
- Building Docker Images with Buildpacks Works
- Building Docker Images from Dockerfiles Does Not Work
- Getting Creative: Running the Build in a Virtual Machine
- Building the Docker Image
- Using the Docker Image to Build a Dockerfile without Root or Privilege Escalation
- Improving Performance with KVM
Principle of Least Privileges: Why Root Should Be Disallowed
The principle of least privilege is a security concept that states a user or process should only be granted the minimum level of access necessary to perform their required tasks, meaning they should only have access to the specific data and resources needed to complete their job, and nothing more.
Implementations of the principle of least privilege involve network firewall rules, file system permissions, network share restrictions, CPU time limits, memory limits, and more. Many of these guardrails present challenges in terms of compliance, but they’re usually relatively easily surmounted. For example, ensuring appropriate file system permissions are in place, credentials are used appropriately, and network proxies are used as required tends to allow tasks to fit within the guardrails.
One of the more challenging guardrails to meet, and also most rewarding in terms of security posture improvements, is to disallow running as root. The root user has a lot of power, such as accessing hardware, altering fundamental system configurations, making any network request, and bypassing local filesystem permissions.
Due to its immense powers, running a task as a limited user is greatly preferred. Security frameworks such as the CIS Benchmarks have rules prohibiting the use of root. Furthermore, running a task as a limited user but allowing the task to escalate privileges to root (using a tool such as sudo, su, doas, etc) is also not ideal. As always, there are exceptions to this rule (some tasks legitimately must run as root); as a general rule, it’s best to deny superuser privileges.
How Root and Privilege Escalation Are Disallowed
The specific way to implement root and privilege escalation disallowance varies by tool. The Linux kernel provides the “No New Privileges” flag upon which all tools are built.
For systemd units, a User= other than root should be specified and NoNewPrivileges= set.
For Kubernetes, set runAsNonRoot: false
and allowPrivilegeEscalation: false
on the security context. When runAsNonRoot: false is set, the USER in the docker image or the security context runAsUser must be set to a numeric value (when a username is specified, it may be mapped to UID 0, which would make that user root). To ensure compliance at a cluster level, Kyverno policies can be used, such as Require runAsNonRoot and Disallow Privilege Escalation. The Restricted Pod Security Standard also enforces these rules.
Building Docker Images with Buildpacks Works
There are numerous ways to build docker images. Some work fine under these restrictions, and others require creative solutions.
Buildpacks are a great way to build docker images and they don’t require root at all, therefore working fine under these restrictions. pack
cli can build an image without root privileges, as can Spring Boot’s bootBuildImage.
Building Docker Images from Dockerfiles Does Not Work
Building an image from a Dockerfile
is more tricky.
Kaniko can build a docker image from a Dockerfile without root and privilege escalation, but Kaniko is unmaintained and has several unmitigated security issues.
Podman/Buildah (Podman bundles Buildah using it to build images) and Docker/BuildKit are the de facto ways to build Dockerfile
s, but both require running as root or running as non-root with privilege escalation permitted. Docker’s Rootless Mode explains the situation:
Rootless mode does not use binaries with
SETUID
bits or file capabilities, exceptnewuidmap
andnewgidmap
, which are needed to allow multiple UIDs/GIDs to be used in the user namespace.
Using SETUID
or file capabilities for privilege escalation are not permitted when privilege escalation is disallowed. An attempt to use rootless buildkit with privilege escalation disallowed results in this error: error: [rootlesskit:parent] error: failed to setup UID/GID map: newuidmap 10 [0 1000 1 1 100000 65536] failed: newuidmap: Could not set caps: exit status 1
. Rootless buildah
similarly fails.
Building a Dockerfile
intrinsically requires setting up user namespaces which requires privilege escalation… so there is no way to build a Dockerfile
without privilege escalation. Or is there?
Getting Creative: Running the Build in a Virtual Machine
We’ve established that building Dockerfile
into a docker image is impossible without privilege escalation, so what if we could allow privilege escalation but continue to disallow it at the same time? Virtualization (or emulation) allows running a computer as a regular task with no special privileges. Isolation for security purposes is a major benefit of virtualization and is why it was invented, so using it in this way aligns with the purpose of the tool.
The Kubernetes pod, for example, continues to have all restrictions applied to it, the pod runs QEMU, and the virtual machine runs Linux which builds the docker image. All of the network/filesystem/other restrictions that apply to the pod continue to apply within the virtual machine. The virtual machine is just another program, running as the user of the pod just like anything else, so it cannot do anything that the pod cannot do. Once the virtual machine builds the docker image, it can be exported as a file or pushed to a docker registry.
To implement this approach, a docker image containing QEMU and a VM image must be built, and the VM image must contain buildkit (or buildah, as preferred) and the tools necessary to run it.
Building the Docker Image
This implementation builds on qemu-docker, which already does the tedious work of providing, configuring, and maintaining QEMU within a docker image. It also relies on Alpine to produce a small, maintainable Linux kernel and initramfs for the VM.
QEMU is configured to use a microvm (inspired by Amazon’s firecracker which powers Lambda) and virtio devices for minimal start up time and maximum performance. 9p is used for sharing files (such as the Dockerfile
to be built and the files it references) between the VM and host.
Note that best practices as followed for these scripts. Linters, which are key to secure, maintainable, quality DevSecOps, are run including shellcheck and Docker Build Checks. Docker image digests are used.
Save these files in a new directory.
init.sh
: Runs as init inside the QEMU virtual machine.
#!/bin/sh
# shellcheck shell=busybox
set -euo pipefail
export "PATH=/usr/local/bin:/usr/local/sbin:/usr/bin:/usr/sbin:/bin:/sbin"
mount cgroup /sys/fs/cgroup -t cgroup
mount -o remount,rw,noatime /
mkdir -p /app
mount -t 9p -o trans=virtio,nodev,msize=5000000,cache=loose app /app
mkdir -p /variables
mount -t 9p -o trans=virtio,nodev,msize=5000000,cache=loose variables /variables
on_exit()
{
status=$?
echo "$status" > /variables/exit_code
poweroff -f
}
trap on_exit EXIT
echo /dev /dev/shm /dev/pts /proc /sys /sys/fs/cgroup /run | xargs -n1 | xargs -I{} sh -c 'mkdir -p /buildkit{} && mount -o bind,ro,nosuid,nodev,noexec {} /buildkit{}'
# When https://gitlab.alpinelinux.org/alpine/mkinitfs/-/merge_requests/184 is merged and available,
# Modify entrypoint.sh (see comments in that file)
# and remove this block
ifconfig eth0 10.0.2.15 netmask 255.255.255.0
ip route add 0.0.0.0/0 via 10.0.2.2 dev eth0
echo "nameserver 10.0.2.3" > /etc/resolv.conf
/build-image.sh
entrypoint.sh
:
#!/bin/sh
# shellcheck shell=busybox
set -euo pipefail
if [ -z "${CI_PROJECT_DIR:-}" ]; then
>&2 echo "CI_PROJECT_DIR must be set to the directory containing the files necessary to build the docker image."
exit 1
fi
if [ ! -d "${CI_PROJECT_DIR}" ]; then
>&2 echo "CI_PROJECT_DIR is not set correctly. The directory ${CI_PROJECT_DIR} does not exist."
exit 1
fi
if [ -z "${DOCKER_AUTH_CONFIG:-}" ]; then
>&2 echo "DOCKER_AUTH_CONFIG must be set. See https://docs.gitlab.com/ee/ci/docker/using_docker_images.html#determine-your-docker_auth_config-data"
exit 1
fi
if [ -z "${CI_JOB_URL:-}" ]; then
>&2 echo "CI_JOB_URL must be set."
exit 1
fi
if [ -z "${CI_PROJECT_URL:-}" ]; then
>&2 echo "CI_PROJECT_URL must be set."
exit 1
fi
if [ -z "${DOCKER_FILE:-}" ]; then
>&2 echo "DOCKER_FILE must be an absolute path or path relative to ${CI_PROJECT_DIR} for the Dockerfile."
exit 1
fi
DOCKER_FILE="$(realpath -s --relative-to="${CI_PROJECT_DIR}" "${DOCKER_FILE}")"
if [ ! -f "${DOCKER_FILE}" ]; then
>&2 echo "DOCKER_FILE is not set correctly. The file ${DOCKER_FILE} does not exist."
exit 1
fi
if [ -z "${DOCKER_IMAGE:-}" ]; then
>&2 echo "DOCKER_IMAGE must be set to the fully qualified name (including registry, name, and tag) where the image will be pushed."
exit 1
fi
if [ -z "${BUILD_CONTEXT:-}" ]; then
>&2 echo "BUILD_CONTEXT must be an absolute path or path relative to ${CI_PROJECT_DIR} of the directory to use as the build context."
exit 1
fi
BUILD_CONTEXT="$(realpath -s --relative-to="${CI_PROJECT_DIR}" "${BUILD_CONTEXT}")"
if [ ! -d "${BUILD_CONTEXT}" ]; then
>&2 echo "BUILD_CONTEXT is not set correctly. The directory ${BUILD_CONTEXT} does not exist."
exit 1
fi
variables_directory="$(mktemp -d -t vars.XXX)"
on_exit()
{
if [ -f "${variables_directory}/exit_code" ]; then
status="$(cat "${variables_directory}/exit_code")"
else
>&2 echo "Failed to run buildkit"
status=1
fi
rm -rf "${variables_directory}"
exit "$status"
}
trap on_exit EXIT
echo "${DOCKER_AUTH_CONFIG}" > "${variables_directory}/DOCKER_AUTH_CONFIG"
echo "${CI_JOB_URL}" > "${variables_directory}/CI_JOB_URL"
echo "${CI_PROJECT_URL}" > "${variables_directory}/CI_PROJECT_URL"
echo "${DOCKER_FILE}" > "${variables_directory}/DOCKER_FILE"
echo "${DOCKER_IMAGE}" > "${variables_directory}/DOCKER_IMAGE"
echo "${BUILD_CONTEXT}" > "${variables_directory}/BUILD_CONTEXT"
if [ -c /dev/kvm ]; then
echo "Using KVM acceleration."
kvm="yes"
cpu="host"
else
echo "Not using KVM acceleration. Performance will drastically suffer."
echo "To use KVM acceleration, the /dev/kvm device must exist and be readable and writable by the current user."
kvm=""
# Software is starting to target x86-64-v2 as a baseline, not running if
# that baseline is unmet.
# x86-64-v3 is safe as almost all 2015 or later CPUs meet this baseline.
# x86-64-v4 is not safe as it requires AVX512 which is not available in
# 2022 and later Intel CPUs.
cpu="Skylake-Client-v4"
fi
# When https://gitlab.alpinelinux.org/alpine/mkinitfs/-/merge_requests/184 is merged and available,
# Add 'ip=10.0.2.15::10.0.2.2:255.255.255.0:guest:eth0::10.0.2.3:10.0.2.3' to the -append argument
# and remove the block in the init file
qemu-system-x86_64 \
${kvm:+--enable-kvm} \
--cpu "${cpu}" \
-M microvm \
-nographic \
-no-user-config \
-no-reboot \
-m 1500 \
-smp 2 \
-kernel /var/lib/qemu-docker/vmlinuz-virt \
-initrd /var/lib/qemu-docker/initramfs-virt \
-append 'quiet earlyprintk=ttyS0 console=ttyS0 reboot=t panic=-1 root=/dev/vda rootfstype=ext4 modules=9p mitigations=off' \
-drive id=buildkit,file=/var/lib/qemu-docker/buildkit.qcow2,format=qcow2,if=none,cache=unsafe \
-device virtio-blk-device,drive=buildkit \
-device virtio-rng-device \
-fsdev "local,path=${CI_PROJECT_DIR},security_model=none,id=app,readonly=on" \
-device virtio-9p-device,fsdev=app,mount_tag=app \
-fsdev "local,path=${variables_directory},security_model=none,id=variables" \
-device virtio-9p-device,fsdev=variables,mount_tag=variables \
-netdev user,id=unet \
-device virtio-net-device,netdev=unet \
-net user \
|| :
Dockerfile
:
# syntax=docker/dockerfile:1.13.0@sha256:426b85b823c113372f766a963f68cfd9cd4878e1bcc0fda58779127ee98a28eb
# check=error=true
FROM alpine:3.21.2@sha256:56fa17d2a7e7f168a043a2712e63aed1f8543aeafdcee47c58dcffe38ed51099 AS builder
# Size of the disk used in the VM.
# The docker image is built on this disk so the disk size limits the size of the image that can be built.
# If building docker images using this docker image fails due to running out of disk space, increase this value.
# Note that since qcow2 (which is a sparse) is used as the disk image format, changing the disk size doesn't (significantly) change the size of this docker image.
ARG DISK_SIZE=100G
ARG BUILDKIT_IMAGE=moby/buildkit:v0.19.0@sha256:14aa1b4dd92ea0a4cd03a54d0c6079046ea98cd0c0ae6176bdd7036ba370cbbe
SHELL ["/bin/sh", "-euo", "pipefail", "-c"]
WORKDIR /build
COPY build-image.sh init.sh /build/
RUN <<EOF
apk add --no-cache crane curl e2fsprogs e2tools qemu-img linux-virt
# Build initramfs
echo 'features="9p base ext4 kms network virtio"' > /etc/mkinitfs/mkinitfs.conf
kernel="$(basename $(find "/lib/modules/"* -maxdepth 0))"
/sbin/mkinitfs "$kernel"
# get the buildkit docker image and convert it to a disk image
mkdir /buildkit
crane export --platform "linux/amd64" "${BUILDKIT_IMAGE}" - | tar -C /buildkit -xf -
cp build-image.sh /buildkit
cp init.sh /buildkit/sbin/init
mke2fs -t ext4 -d /buildkit buildkit.raw "$DISK_SIZE"
qemu-img convert -f raw buildkit.raw -O qcow2 buildkit.qcow2
rm buildkit.raw
EOF
FROM qemux/qemu-docker:6.13@sha256:33c41464ef2bc8207d23d254d0c00c9e8da9b50800fc6e84554191b10402088c
SHELL ["/bin/sh", "-euo", "pipefail", "-c"]
WORKDIR /var/lib/qemu-docker/
COPY --from=builder --chmod=0666 /boot/vmlinuz-virt /boot/initramfs-virt /build/buildkit.qcow2 /var/lib/qemu-docker/
COPY --chmod=0555 entrypoint.sh /var/lib/qemu-docker/
RUN useradd --uid 1005 --create-home --shell /usr/sbin/nologin qemu
USER 1005
ENTRYPOINT ["/usr/bin/tini", "-s", "/var/lib/qemu-docker/entrypoint.sh"]
build-image.sh
#!/bin/sh
# shellcheck shell=busybox
set -euo pipefail
CI_JOB_URL="$(cat /variables/CI_JOB_URL)"
CI_PROJECT_URL="$(cat /variables/CI_PROJECT_URL)"
DOCKER_FILE="$(cat /variables/DOCKER_FILE)"
DOCKER_IMAGE="$(cat /variables/DOCKER_IMAGE)"
BUILD_CONTEXT="$(cat /variables/BUILD_CONTEXT)"
mkdir -p "${HOME}/.docker"
cp "/variables/DOCKER_AUTH_CONFIG" "${HOME}/.docker/config.json"
cd /app
SOURCE_DATE_EPOCH=1 buildctl-daemonless.sh build \
--progress=plain \
--output rewrite-timestamp=true,type=image,oci-mediatypes=false,push=true,name="$DOCKER_IMAGE" \
--frontend dockerfile.v0 \
--local context="${BUILD_CONTEXT}" \
--local dockerfile=. \
--opt filename="${DOCKER_FILE}" \
--opt attest:provenance=mode=max,builder-id="${CI_JOB_URL}" \
--opt label:org.opencontainers.image.source="${CI_PROJECT_URL}"
Build the docker image:
docker build -t qemu-buildkit .
The image can be built by itself (dogfooding).
You can push the image to a docker registry for use with a Kubernetes cluster, GitLab job, or anything else.
Using the Docker Image to Build a Dockerfile without Root or Privilege Escalation
The quickest and easiest way to test this out is to cd
to a directory containing a Dockerfile
then run:
docker run -it -e CI_JOB_URL=http://ci.example.com/job1234 -e CI_PROJECT_URL=http://myproject.example.com -e BUILD_CONTEXT=. -e DOCKER_FILE=Dockerfile -e DOCKER_IMAGE=docker.io/example/myimage -e DOCKER_AUTH_CONFIG='{"auths":{"docker.io":{"auth":"base64ofusernamecolonpassword"}}}' -e CI_PROJECT_DIR=/app -w /app -v "$(pwd):/app:Z" qemu-buildkit
There are a number of required environement variables:
CI_JOB_URL
: Used as builder id in SLSA provenance attached to the image builtCI_PROJECT_URL
: Set as theorg.opencontainers.image.source
label on the image builtCI_PROJECT_DIR
: Path within the docker image that contains the Dockerfile to be built.BUILD_CONTEXT
: Directory to set as the build context when building the docker image. Usually set to.
.DOCKER_FILE
: Name of theDockerfile
file to build. Usually set toDockerfile
. This file is an absolute path or a path resolved relative toCI_PROJECT_DIR
.DOCKER_IMAGE
: fully qualified name (including registry, name, and tag) where the image will be pushed.DOCKER_AUTH_CONFIG
: JSON containing the credentials used to access docker registries. Must contain credentials used to push the image to the name provided inDOCKER_IMAGE
. GitLab provides documentation for how to setupDOCKER_AUTH_CONFIG
.
Modify the scripts as necessary to best suit your specific needs and requirements.
The image is now ready to be used in Kubernetes, GitLab jobs, GitHub Actions, Jenkins, and anywhere else.
Improving Performance with KVM
You may notice the message “Not using KVM acceleration. Performance will drastically suffer.” in the job output. When building simple, small docker images, the slowdown isn’t terrible, adding just a few seconds compared to a normal build. However, for larger images or images that involve more work to build, the slowdown is drastic, on the order of 10x or more time required. However, we can (usually) address this performance issue.
Even if the slowdown is significant, this approach at least allows for building images, a task that is otherwise impossible under these conditions.
KVM is virtualization, enabling the use of the CPU’s virtualization instructions along with other features. Without the availability of KVM, QEMU falls back to full emulation. For comparison, emulation means every CPU instruction is run in software, versus virtualization, where many CPU instructions run natively with only special or unsafe ones intercepted and handled by QEMU. Virtualization is far less overhead than emulation, with virtualized workloads running at essentially native speed.
Virtualization, specifically Linux KVM, is the foundation of the cloud; for example, AWS runs on KVM. KVM is widely used, secure, and well-supported.
On Kubernetes, Generic Device Plugin is a common way to expose devices, such as KVM, to pods. Only /dev/kvm
needs to be exposed.
For GitLab Kubernetes executors, the GitLab runner configuration has to be altered.
To expose KVM to the container using docker run
, pass --device /dev/kvm
. For example, the test command previously provided altered to use KVM is:
docker run -it -e CI_JOB_URL=http://ci.example.com/job1234 -e CI_PROJECT_URL=http://myproject.example.com -e BUILD_CONTEXT=. -e DOCKER_FILE=Dockerfile -e DOCKER_IMAGE=docker.io/example/myimage -e DOCKER_AUTH_CONFIG='{"auths":{"docker.io":{"auth":"base64ofusernamecolonpassword"}}}' -e CI_PROJECT_DIR=/app -w /app -v "$(pwd):/app:Z" --device /dev/kvm qemu-buildkit
Now enjoy performant, secure, Dockerfile
builds.
Building Docker Images Without Root or Privilege Escalation by Craig Andrews is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.