Containers might feel like lightweight VMs, but under the hood they’re something else entirely. Images, Containers, Volumes, networks and more are all built on top of fundamental concepts in Linux. This post attempts to demystify these concepts and reveal what is actually going on under the hood.
Containers vs VMs
If your frame of reference is virtual machines, it’s tempting to think of containers the same way—but they’re not. The diagram below illustrates the difference between the two concepts.
Virtual machines
Containers (Docker)
Left: each VM gets emulated hardware and a full guest OS. Right: one kernel; containers are “jailed” processes with isolated filesystem, namespaces, and cgroups—no separate OS.
Virtual machines
Physical hardware runs a host kernel and OS, plus a hypervisor that emulates hardware for each guest VM. Each VM runs its own kernel and OS and is isolated from the host and from other VMs. From the guest OS’s perspective it’s running on hardware; it doesn’t know about the hypervisor or the host unless the hypervisor is configured to expose the host machine.
Containers
A container is just a regular Linux process (possibly with child processes) running on the host kernel, sharing the same OS. Isolation comes from kernel primitives like namespaces, cgroups, chroot, and capabilities:
- Namespaces — pid (own process tree only), mount (own view of the filesystem), user (e.g. root-in-container → unprivileged on host), network (own interfaces, no visibility of host/others).
- cgroups — limits and accounting for CPU, memory, disk I/O, and network.
- chroot — the process’s root is a chosen directory; it cannot see or reach paths outside that subtree.
- Capabilities — fine-grained privileges instead of full root.
Together these form a sandbox. The process is separated from the rest of the host without a separate kernel or OS.
Linux filesystem and permissions (primer)
Before we look at chroot and volume mounts, here is a quick primer on how Linux organizes files.
Paths start at / (root). Each file has permissions (owner/group/others) and ownership. Docker uses this when setting up container roots and volume mounts.
drwxr-xr-x root:root drwxr-xr-x root:root drwxr-xr-x root:root drwxr-xr-x root:root drwxrwxrwx root:root drwxr-xr-x myappuser:myappuser Linux mounts (primer)
The single directory tree is built from many filesystems attached at mount points.
The kernel keeps a mount table: what is mounted where.
For example the root directory / is mounted from one filesystem on disk A, the home directory /home is mounted from a disk partition on disk B.
Both /etc and /home are in the file tree next to each other, but they are actually on different filesystems on different disks.
Mount points attach filesystems into the tree; the kernel's mount table records what is mounted where. Bind mount = same filesystem at another path.
chroot — setting a new root directory for a process
The process’s ”/” is a chosen directory on the host. It cannot see or reach anything above it.
/ ├── home/ ├── etc/ ├── var/ │ └── lib/ │ └── docker/ │ └── containers/ │ └── abc123…/ ← container root │ ├── app/ │ ├── etc/ │ └── bin/
/ ← container root ├── app/ ├── etc/ └── bin/
Paths like /etc or /app refer only to files under the container root.
Mount namespace — isolated mount table
The container has its own mount table: it only sees the mounts Docker set up (overlay for /, plus any volume mounts). It cannot see or unmount the host’s mounts. Volume mapping is done by adding entries to this isolated table.
PID namespace — isolated process tree
The container only sees its own processes. PID 1 inside the container is a different process on the host; host PIDs are invisible.
Only these processes exist from the container’s perspective. Host PIDs 1, 100, 5000 are invisible. This is why you can’t see host processes or kill host processes from inside a container.
Docker images
Imagine you run the following command to build a new image from a Dockerfile:
docker build -t my-image:1.0.0 .
Docker creates a new image with the name my-image and tags it with the version 1.0.0.
It uses your Dockerfile adding the files and folders you’ve specified and running the commands you’ve specified.
A Docker image is not like a VM image, it is not a single file system snapshot with all your files and folders.
Instead it is made up of layers.
Dockerfile instructions like RUN, COPY, and ADD create new layers.
Each layer is a read only archive that contains the files added, modified, or deleted since the previous layer.
Other instructions like ENV, WORKDIR, and CMD do not create new layers, instead they add metadata to the image.
alpine:3.19apk add --no-cache nodejsNODE_ENV=production/apppackage*.json ./npm ci --omit=dev. .["node", "server.js"]Only FROM, RUN, COPY, and ADD create layers. ENV, WORKDIR, CMD, etc. set metadata only.
After running the docker build command you’ve now got an image entry that points at the last layer in the image, sha256:000080bf7f36.
That layer points to the second last layer in the image (sha256:00006e913295), and so on until it reaches the first layer in the image (sha256:00007bedb7f3).
Because each layer is immutable and is identified by its instruction input, Docker can efficiently reuse layers when building images. It is important not to confuse this with the idea of reproducible builds, which is a different concept. Docker does not guarantee the same Dockerfile instruction will always produce the same output, this is up to the author of the Dockerfile to ensure. Adding a single COPY or CMD (or changing one line) means every layer before that line in the image is unchanged. This allows Docker to reuse the existing layer archive file for those layers and only rebuild the changed layer and any below.
alpine:3.19cachedapk add --no-cache nodejscachedNODE_ENV=productioncached/srcrebuiltpackage*.json ./rebuiltnpm ci --omit=devrebuilt. .rebuilt["node", "server.js"]rebuiltChanging WORKDIR from /app to /src invalidates cache at that line. Commands before: cache hit. From layer 3 onward: rebuilt.
FROM instruction
The FROM instruction is special in that it essentially links your first image layer to a previous image’s layers.
This way you can easily build on top of existing images like FROM ubuntu:20.04 or FROM node:18-alpine.
This is also what enables the idea of multi-stage builds, where you define multiple images in a single Dockerfile which inherit from each other.
alpine:3.19package*.json ./npm ciFROM doesn't add one layer—it attaches your image to the base image's entire layer stack. Your RUN/COPY/ADD layers sit on top.
Starting a container from an image
First, the image’s read-only layers are merged into a single coherent read-only filesystem (e.g. via overlay). This is done by overlaying the layers on top of each other.
Each layer (bottom to top) adds or changes files. The union is the read-only filesystem for the container.
When you run a container, Docker then creates a thin read-write layer on top of that, sets up namespaces and cgroups, and runs the entrypoint. The container is that writable layer plus the image layers—ephemeral unless you commit or use volumes.
Start container to see the writable layer. Stop freezes the view; Restart clears and returns to image only.
Volume mapping (bind mounts and named volumes)
To persist or share data, you mount a path from the host (or a Docker-managed volume) into the container. Bind mounts and named volumes are implemented by adding mounts in the container’s mount namespace. With -v /host/path:/container/path, the host directory appears in the container’s mount table at the chosen path; writes in the container go directly to the host path. With -v myvol:/container/path, Docker creates a directory under e.g. /var/lib/docker/volumes/ and mounts it; the data survives docker rm and can be shared between containers. Use the animation below to step through no volume, bind mount, and named volume.
Without a volume, writes in the container are only in the R/W layer and disappear when the container is removed.
User mapping
When a container writes to a bind-mounted host path, files are created with the same UID/GID the process has—and by default that UID is the same inside and outside the container.
Host user list (examples)
On the host, each user has a numeric UID. The kernel uses these IDs for file ownership and process privileges. Containers share the same kernel, so the same UID list applies unless you use user namespaces.
No user mapping (default)
By default, Docker does not use a user namespace. The container process runs with the same numeric UID on the host as inside the container. What matters is the number: root (0) in the container is root on the host; UID 1000 in the container corresponds to host UID 1000 (e.g. alice). The username inside the container (from the image’s /etc/passwd) does not have to match the host—same UID, different names.
With user mapping (user namespace)
You can enable user namespace remapping (e.g. --userns-remap=alice in the daemon, or /etc/subuid and /etc/subgid). Container UIDs are then mapped into a range on the host (e.g. 100000–165535). Root inside the container is no longer root on the host.
Docker networking
Network isolation for containers comes from the network namespace: the container gets its own view of the network stack. We start with what that stack looks like on Linux, then how Docker uses it—port mapping, container-to-container, and host mode.
The Linux network stack (primer)
A typical Linux host has physical interfaces (e.g. eth0), a loopback device (lo), and the kernel’s networking stack (routing, iptables/nftables, etc.). Docker’s networking builds on these same building blocks.
A packet flows through the stack. Incoming: interface → routing → iptables → app. Outgoing: app → (DNS for hostnames) → routing → iptables → interface.
Network namespace — own interfaces only
The container has its own network stack: loopback and a virtual interface (e.g. veth) paired with the host. It cannot see the host’s physical NICs or other containers’ interfaces.
Only these interfaces exist. No visibility of host eth0 or other containers.
Port mapping (host :8080 → container :8080)
When you publish a port with -p 8080:8080, the host listens on 8080 and uses iptables/nftables DNAT to rewrite the destination to the container’s IP and port. The packet is then forwarded via docker0 and the veth pair into the container. Use the animation below to step through the path.
Packet arrives on host eth0 with destination host:8080.
Networking between containers
When two containers are on the same user-defined network (e.g. docker network create mynet), they each get a veth pair onto the same bridge on the host. The bridge is outside the containers—it lives in the host network namespace. Packets between containers go from one container’s eth0 through its veth to the bridge, then through the other veth to the other container’s eth0; no NAT, just bridge forwarding. On user-defined networks, Docker runs an embedded DNS so containers can reach each other by name (e.g. “ping app” resolves “app” to that container’s IP). The animation below shows the layout (which interface belongs to which container, and that the bridge is on the host), then DNS resolution, then the packet path.
Node server (A) and Postgres (B) each have their own network namespace with lo and eth0. The bridge and veth pairs live on the host.
Host network mode
With --network host, the container does not get its own network namespace. It uses the host’s network stack directly—no veth pair, no bridge, no NAT. The container sees the host’s eth0 and lo; if it listens on port 8080, it binds to the host’s 8080. Port mapping (-p) is ignored. This mode is useful when you need maximum performance or direct access to host network interfaces (e.g. for certain monitoring or high-throughput workloads), but it removes network isolation.
Both animations run together. Host mode packet (blue) reaches the app in half the time—fewer hops, no iptables, bridge, or veth.
Why Docker? The daemon and the engine
If containers are “just” processes plus namespaces, cgroups, and chroot, why do we need Docker at all? You could in principle create a container by hand: use unshare to create new namespaces, chroot into a rootfs, configure cgroups, and run your process. The reason Docker exists is that it orchestrates all of these primitives and adds a consistent workflow around images, networking, and storage.
When you run docker run or docker build, the docker command-line tool does not perform the work itself. It talks to the Docker daemon (often called the Docker engine)—a long-running process (e.g. dockerd) that runs on the host, usually with root privileges. The daemon is responsible for:
- Images and layers — Storing and resolving image manifests and layer blobs, building images from Dockerfiles (running each instruction in a temporary container, committing layers), and managing the content-addressable layer store (e.g. under
/var/lib/docker). - Containers — Creating the container’s filesystem (e.g. overlay merge of image layers plus a writable layer), creating the namespaces (pid, mount, network, user, etc.), applying cgroup limits, setting up the rootfs and mounts, and starting the entrypoint process. On shutdown, it cleans up namespaces, mounts, and the writable layer unless you use volumes or commit.
- Networking — Creating and attaching virtual interfaces (veth pairs), bridges (e.g.
docker0), and iptables/nftables rules for port publishing and container-to-container communication. It also runs an embedded DNS for user-defined networks so containers can resolve each other by name. - Volumes — Creating and mounting bind mounts or named volumes into the container’s mount namespace and managing volume lifecycle.
So the engine is the daemon plus the low-level plumbing (containerd, runc, etc., depending on your Docker version). The CLI is a client; the daemon is the single place that owns image and container state and ties all the Linux primitives together. Without it, you’d be manually creating namespaces, mounting overlays, and wiring cgroups and networks every time. Docker gives you a single API, a portable image format, layer caching, and declarative Dockerfiles so you don’t have to script those steps yourself.
Alternatives
Because containers are just processes plus Linux primitives, the same workflow — images, layers, run, build — can be implemented by different tools.
Podman is a prominent alternative.
It is daemonless: there is no long-running podman process.
When you run podman run or podman build, the Podman CLI (or its subprocesses) sets up namespaces, overlay mounts, and cgroups directly and then exits or hands the container to a short-lived conmon process for monitoring.
That avoids a single root-owned daemon and can make it easier to run containers as an unprivileged user (rootless mode).
Podman is CLI-compatible with Docker for many commands (podman run accepts the same flags as docker run), and it uses the same image format (OCI) and can pull from the same registries.
So existing Docker images and Dockerfiles usually work as-is with Podman.
Summary
Containers are not lightweight VMs. They are processes on the host kernel, isolated by namespaces (pid, mount, user, network), cgroups, chroot, and capabilities. The Docker daemon (engine) is what ties it all together. The CLI talks to the daemon, which owns images, layers, container lifecycle, networking, and volumes. It orchestrates the Linux primitives so you get a single API and portable image format instead of scripting namespaces and mounts by hand. An image is a stack of read-only layers; a running container adds a thin writable layer and uses overlay (or similar) to present a single filesystem. Volumes and port mapping are just bind mounts and NAT rules in the container’s mount and network namespaces. Once you see Docker as a tidy wrapper around these Linux primitives, the “magic” becomes predictable: same kernel, same building blocks, just a different view of the system.