Snorre.io

Demystifying Docker – A cursory look beneath the magic

Containers might feel like lightweight VMs, but under the hood they’re something else entirely. Images, Containers, Volumes, networks and more are all built on top of fundamental concepts in Linux. This post attempts to demystify these concepts and reveal what is actually going on under the hood.

Containers vs VMs

If your frame of reference is virtual machines, it’s tempting to think of containers the same way—but they’re not. The diagram below illustrates the difference between the two concepts.

Virtual machines

Physical hardware
CPU, RAM, disk, NIC
Host OS / Kernel
Host kernel
Hypervisor
Virtualises hardware for guests (each VM)
VM 1
Virtual hardware
OS / kernel
Processes / apps
VM 2
Virtual hardware
OS / kernel
Processes / apps
VM 3
Virtual hardware
OS / kernel
Processes / apps

Containers (Docker)

Physical hardware
CPU, RAM, disk, NIC
Host OS / Kernel
Single shared kernel
Docker
Daemon, image layers, runtime
Container A
Process(es)
Filesystem isolation
Network isolation
Container B
Process(es)
Filesystem isolation
Network isolation
Container C
Process(es)
Filesystem isolation
Network isolation

Left: each VM gets emulated hardware and a full guest OS. Right: one kernel; containers are “jailed” processes with isolated filesystem, namespaces, and cgroups—no separate OS.

Virtual machines

Physical hardware runs a host kernel and OS, plus a hypervisor that emulates hardware for each guest VM. Each VM runs its own kernel and OS and is isolated from the host and from other VMs. From the guest OS’s perspective it’s running on hardware; it doesn’t know about the hypervisor or the host unless the hypervisor is configured to expose the host machine.

Containers

A container is just a regular Linux process (possibly with child processes) running on the host kernel, sharing the same OS. Isolation comes from kernel primitives like namespaces, cgroups, chroot, and capabilities:

  • Namespaces — pid (own process tree only), mount (own view of the filesystem), user (e.g. root-in-container → unprivileged on host), network (own interfaces, no visibility of host/others).
  • cgroups — limits and accounting for CPU, memory, disk I/O, and network.
  • chroot — the process’s root is a chosen directory; it cannot see or reach paths outside that subtree.
  • Capabilities — fine-grained privileges instead of full root.

Together these form a sandbox. The process is separated from the rest of the host without a separate kernel or OS.

Linux filesystem and permissions (primer)

Before we look at chroot and volume mounts, here is a quick primer on how Linux organizes files.

Linux filesystem and permissions (primer)

Paths start at / (root). Each file has permissions (owner/group/others) and ownership. Docker uses this when setting up container roots and volume mounts.

File tree with permissions
d rwx r-x r-x = dir owner group others
/
root of filesystem
drwxr-xr-x root:root
home/
user home dirs
drwxr-xr-x root:root
etc/
config files
drwxr-xr-x root:root
var/
variable data (logs, caches)
drwxr-xr-x root:root
tmp/
temporary, world-writable
drwxrwxrwx root:root
app/
your app (common in containers)
drwxr-xr-x myappuser:myappuser
r read w write x execute (dirs: enter)

Linux mounts (primer)

The single directory tree is built from many filesystems attached at mount points. The kernel keeps a mount table: what is mounted where. For example the root directory / is mounted from one filesystem on disk A, the home directory /home is mounted from a disk partition on disk B. Both /etc and /home are in the file tree next to each other, but they are actually on different filesystems on different disks.

Linux mounts (primer)

Mount points attach filesystems into the tree; the kernel's mount table records what is mounted where. Bind mount = same filesystem at another path.

Mount table (conceptual)
path ← mounted from (source)
/
rootfs (e.g. ext4 on sda1)
/home
ext4 on sda2
/mnt/backup
ext4 on sdb1
/mnt/data
bind from /data (same filesystem, other path)

chroot — setting a new root directory for a process

The process’s ”/” is a chosen directory on the host. It cannot see or reach anything above it.

Host filesystem
/
├── home/
├── etc/
├── var/
│   └── lib/
│       └── docker/
│           └── containers/
│               └── abc123…/  ← container root
│                   ├── app/
│                   ├── etc/
│                   └── bin/
What the process sees (inside container)
/  ← container root
├── app/
├── etc/
└── bin/

Paths like /etc or /app refer only to files under the container root.

Mount namespace — isolated mount table

The container has its own mount table: it only sees the mounts Docker set up (overlay for /, plus any volume mounts). It cannot see or unmount the host’s mounts. Volume mapping is done by adding entries to this isolated table.

Host mount table
/ — rootfs
/home — disk partition
/var/lib/docker/... — volumes, overlay
Container mount table (isolated)
/ — overlay (image + RW layer)
/app/data — volume (bind or named)

PID namespace — isolated process tree

The container only sees its own processes. PID 1 inside the container is a different process on the host; host PIDs are invisible.

Host view
PID 1 — init
PID 100 — systemd
PID 200 — sshd
PID 5000 — dockerd
PID 5001 — app (container)
PID 5002 — worker
PID 5003 — worker
Container view (same processes)
PID 1 — app
PID 2 — worker
PID 3 — worker

Only these processes exist from the container’s perspective. Host PIDs 1, 100, 5000 are invisible. This is why you can’t see host processes or kill host processes from inside a container.

Docker images

Imagine you run the following command to build a new image from a Dockerfile:

docker build -t my-image:1.0.0 .

Docker creates a new image with the name my-image and tags it with the version 1.0.0. It uses your Dockerfile adding the files and folders you’ve specified and running the commands you’ve specified. A Docker image is not like a VM image, it is not a single file system snapshot with all your files and folders. Instead it is made up of layers. Dockerfile instructions like RUN, COPY, and ADD create new layers. Each layer is a read only archive that contains the files added, modified, or deleted since the previous layer. Other instructions like ENV, WORKDIR, and CMD do not create new layers, instead they add metadata to the image.

Command 1 of 8 · 1 layerTotal: 37 MB
FROMalpine:3.19
RUNapk add --no-cache nodejs
ENVNODE_ENV=production
WORKDIR/app
COPYpackage*.json ./
RUNnpm ci --omit=dev
COPY. .
CMD["node", "server.js"]
sha256:00007bedb7f337 MB

Only FROM, RUN, COPY, and ADD create layers. ENV, WORKDIR, CMD, etc. set metadata only.

After running the docker build command you’ve now got an image entry that points at the last layer in the image, sha256:000080bf7f36. That layer points to the second last layer in the image (sha256:00006e913295), and so on until it reaches the first layer in the image (sha256:00007bedb7f3).

Because each layer is immutable and is identified by its instruction input, Docker can efficiently reuse layers when building images. It is important not to confuse this with the idea of reproducible builds, which is a different concept. Docker does not guarantee the same Dockerfile instruction will always produce the same output, this is up to the author of the Dockerfile to ensure. Adding a single COPY or CMD (or changing one line) means every layer before that line in the image is unchanged. This allows Docker to reuse the existing layer archive file for those layers and only rebuild the changed layer and any below.

Command 1 of 8 · 1 layer · start from layer 3Total: 37 MB
FROMalpine:3.19cached
RUNapk add --no-cache nodejscached
ENVNODE_ENV=productioncached
WORKDIR/srcrebuilt
COPYpackage*.json ./rebuilt
RUNnpm ci --omit=devrebuilt
COPY. .rebuilt
CMD["node", "server.js"]rebuilt
sha256:00007bedb7f337 MBcached

Changing WORKDIR from /app to /src invalidates cache at that line. Commands before: cache hit. From layer 3 onward: rebuilt.

FROM instruction

The FROM instruction is special in that it essentially links your first image layer to a previous image’s layers. This way you can easily build on top of existing images like FROM ubuntu:20.04 or FROM node:18-alpine. This is also what enables the idea of multi-stage builds, where you define multiple images in a single Dockerfile which inherit from each other.

Step 1 of 8 · 0 layersTotal: 0 B
FROMalpine:3.19
COPYpackage*.json ./
RUNnpm ci
Step through: FROM pulls in the base image's layers, then COPY and RUN add layers on top.

FROM doesn't add one layer—it attaches your image to the base image's entire layer stack. Your RUN/COPY/ADD layers sit on top.

Starting a container from an image

First, the image’s read-only layers are merged into a single coherent read-only filesystem (e.g. via overlay). This is done by overlaying the layers on top of each other.

0 layers applied
sha256:00004f9b1c9324 MB
sha256:0000b106979534 MB
sha256:00002746267f13 MB
sha256:00007bedb7f337 MB
(empty)

Each layer (bottom to top) adds or changes files. The union is the read-only filesystem for the container.

When you run a container, Docker then creates a thin read-write layer on top of that, sets up namespaces and cgroups, and runs the entrypoint. The container is that writable layer plus the image layers—ephemeral unless you commit or use volumes.

Image only (read-only layers)
Image filesystem (read-only)
📁/
📁app
📄index.js
📄package.json
📁etc
📄config.json
📁usr
📁bin
📄node
Container filesystem (read+write union)
📁/
📁app
📄index.js
📄package.json
📁etc
📄config.json
📁usr
📁bin
📄node

Start container to see the writable layer. Stop freezes the view; Restart clears and returns to image only.

Volume mapping (bind mounts and named volumes)

To persist or share data, you mount a path from the host (or a Docker-managed volume) into the container. Bind mounts and named volumes are implemented by adding mounts in the container’s mount namespace. With -v /host/path:/container/path, the host directory appears in the container’s mount table at the chosen path; writes in the container go directly to the host path. With -v myvol:/container/path, Docker creates a directory under e.g. /var/lib/docker/volumes/ and mounts it; the data survives docker rm and can be shared between containers. Use the animation below to step through no volume, bind mount, and named volume.

Step 1 of 3: No volume
Command
docker run --rm my-image
Host
(no volume; container uses R/W layer only)
Container
/app/data/
📄config.json
📄state.json
(ephemeral; lost on rm)

Without a volume, writes in the container are only in the R/W layer and disappear when the container is removed.

User mapping

When a container writes to a bind-mounted host path, files are created with the same UID/GID the process has—and by default that UID is the same inside and outside the container.

Host user list (examples)

On the host, each user has a numeric UID. The kernel uses these IDs for file ownership and process privileges. Containers share the same kernel, so the same UID list applies unless you use user namespaces.

0root
1daemon
1000alice
1001bob

No user mapping (default)

By default, Docker does not use a user namespace. The container process runs with the same numeric UID on the host as inside the container. What matters is the number: root (0) in the container is root on the host; UID 1000 in the container corresponds to host UID 1000 (e.g. alice). The username inside the container (from the image’s /etc/passwd) does not have to match the host—same UID, different names.

Container runs as
root (UID 0)
→ Host sees: root (0)
Container runs as
appuser (UID 1000)
→ Host sees: alice (1000)
Same UID; names can differ (container "appuser" ≠ host "alice").

With user mapping (user namespace)

You can enable user namespace remapping (e.g. --userns-remap=alice in the daemon, or /etc/subuid and /etc/subgid). Container UIDs are then mapped into a range on the host (e.g. 100000–165535). Root inside the container is no longer root on the host.

Example mapping (e.g. alice:100000:65536 in /etc/subuid)
Container 0 → Host 100000
Container 1 → Host 100001
Container 1000 → Host 101000
Container runs as
root (UID 0)
→ Host sees: 100000 (unprivileged)
Container runs as
UID 1000
→ Host sees: 101000

Docker networking

Network isolation for containers comes from the network namespace: the container gets its own view of the network stack. We start with what that stack looks like on Linux, then how Docker uses it—port mapping, container-to-container, and host mode.

The Linux network stack (primer)

A typical Linux host has physical interfaces (e.g. eth0), a loopback device (lo), and the kernel’s networking stack (routing, iptables/nftables, etc.). Docker’s networking builds on these same building blocks.

Linux network stack — how the pieces fit together

A packet flows through the stack. Incoming: interface → routing → iptables → app. Outgoing: app → (DNS for hostnames) → routing → iptables → interface.

Application binds :8080, sockets iptables / nftables NAT, filter Routing which interface? eth0 192.168.1.10 lo 127.0.0.1 DNS "postgres" → IP

Network namespace — own interfaces only

The container has its own network stack: loopback and a virtual interface (e.g. veth) paired with the host. It cannot see the host’s physical NICs or other containers’ interfaces.

Host
eth0 (physical)
docker0 (bridge)
veth0abc ←→ container's eth0
Container
lo (127.0.0.1)
eth0 (e.g. 172.17.0.2)

Only these interfaces exist. No visibility of host eth0 or other containers.

Port mapping (host :8080 → container :8080)

When you publish a port with -p 8080:8080, the host listens on 8080 and uses iptables/nftables DNAT to rewrite the destination to the container’s IP and port. The packet is then forwarded via docker0 and the veth pair into the container. Use the animation below to step through the path.

Step 1 of 5: packet path
host eth0dst host:8080
packet
iptablesDNAT
host:8080 → 172.17.0.2:8080
docker0 + vethforward
container eth0172.17.0.2:8080
app :8080delivered

Packet arrives on host eth0 with destination host:8080.

Networking between containers

When two containers are on the same user-defined network (e.g. docker network create mynet), they each get a veth pair onto the same bridge on the host. The bridge is outside the containers—it lives in the host network namespace. Packets between containers go from one container’s eth0 through its veth to the bridge, then through the other veth to the other container’s eth0; no NAT, just bridge forwarding. On user-defined networks, Docker runs an embedded DNS so containers can reach each other by name (e.g. “ping app” resolves “app” to that container’s IP). The animation below shows the layout (which interface belongs to which container, and that the bridge is on the host), then DNS resolution, then the packet path.

Step 1 of 3: Layout
Container networking: DNS via loopback, then data packet through bridgeContainer AHost (bridge + veth)Container BNode serverlo (127.0.0.11)Docker DNSA eth0172.18.0.2vethbridgevethPostgresB eth0

Node server (A) and Postgres (B) each have their own network namespace with lo and eth0. The bridge and veth pairs live on the host.

Host network mode

With --network host, the container does not get its own network namespace. It uses the host’s network stack directly—no veth pair, no bridge, no NAT. The container sees the host’s eth0 and lo; if it listens on port 8080, it binds to the host’s 8080. Port mapping (-p) is ignored. This mode is useful when you need maximum performance or direct access to host network interfaces (e.g. for certain monitoring or high-throughput workloads), but it removes network isolation.

Bridge vs host: packets move faster with fewer steps
Bridge mode (7 hops) vs Host mode (3 hops) packet path comparisonBridgeClienteth0iptablesdocker0vetheth0appHostClienteth0app

Both animations run together. Host mode packet (blue) reaches the app in half the time—fewer hops, no iptables, bridge, or veth.

Why Docker? The daemon and the engine

If containers are “just” processes plus namespaces, cgroups, and chroot, why do we need Docker at all? You could in principle create a container by hand: use unshare to create new namespaces, chroot into a rootfs, configure cgroups, and run your process. The reason Docker exists is that it orchestrates all of these primitives and adds a consistent workflow around images, networking, and storage.

When you run docker run or docker build, the docker command-line tool does not perform the work itself. It talks to the Docker daemon (often called the Docker engine)—a long-running process (e.g. dockerd) that runs on the host, usually with root privileges. The daemon is responsible for:

  • Images and layers — Storing and resolving image manifests and layer blobs, building images from Dockerfiles (running each instruction in a temporary container, committing layers), and managing the content-addressable layer store (e.g. under /var/lib/docker).
  • Containers — Creating the container’s filesystem (e.g. overlay merge of image layers plus a writable layer), creating the namespaces (pid, mount, network, user, etc.), applying cgroup limits, setting up the rootfs and mounts, and starting the entrypoint process. On shutdown, it cleans up namespaces, mounts, and the writable layer unless you use volumes or commit.
  • Networking — Creating and attaching virtual interfaces (veth pairs), bridges (e.g. docker0), and iptables/nftables rules for port publishing and container-to-container communication. It also runs an embedded DNS for user-defined networks so containers can resolve each other by name.
  • Volumes — Creating and mounting bind mounts or named volumes into the container’s mount namespace and managing volume lifecycle.

So the engine is the daemon plus the low-level plumbing (containerd, runc, etc., depending on your Docker version). The CLI is a client; the daemon is the single place that owns image and container state and ties all the Linux primitives together. Without it, you’d be manually creating namespaces, mounting overlays, and wiring cgroups and networks every time. Docker gives you a single API, a portable image format, layer caching, and declarative Dockerfiles so you don’t have to script those steps yourself.

Alternatives

Because containers are just processes plus Linux primitives, the same workflow — images, layers, run, build — can be implemented by different tools. Podman is a prominent alternative. It is daemonless: there is no long-running podman process. When you run podman run or podman build, the Podman CLI (or its subprocesses) sets up namespaces, overlay mounts, and cgroups directly and then exits or hands the container to a short-lived conmon process for monitoring. That avoids a single root-owned daemon and can make it easier to run containers as an unprivileged user (rootless mode). Podman is CLI-compatible with Docker for many commands (podman run accepts the same flags as docker run), and it uses the same image format (OCI) and can pull from the same registries. So existing Docker images and Dockerfiles usually work as-is with Podman.

Summary

Containers are not lightweight VMs. They are processes on the host kernel, isolated by namespaces (pid, mount, user, network), cgroups, chroot, and capabilities. The Docker daemon (engine) is what ties it all together. The CLI talks to the daemon, which owns images, layers, container lifecycle, networking, and volumes. It orchestrates the Linux primitives so you get a single API and portable image format instead of scripting namespaces and mounts by hand. An image is a stack of read-only layers; a running container adds a thin writable layer and uses overlay (or similar) to present a single filesystem. Volumes and port mapping are just bind mounts and NAT rules in the container’s mount and network namespaces. Once you see Docker as a tidy wrapper around these Linux primitives, the “magic” becomes predictable: same kernel, same building blocks, just a different view of the system.