As businesses move more and more of their infrastructure online due to the effects of competition (not to mention COVID-19), finding the best way to manage that infrastructure becomes more and more important. As we saw in Part 1, Docker enables development teams to have more reliable, repeatable, and testable systems that can be deployed at massive scale with the click of a button.
In this installment, we are going to take a look at the technology behind Docker and how it originated.
From Emulators to Virtual Machines
Docker allows you to run numerous “containers” at the same time on a single computer. Each of these containers acts as if it were a separate computer. It knows nothing about what else is on the computer. For all it knows, it is the only thing on the computer. The computer can, however, manage multiple containers and allow the user to adjust which containers are running on which computers.
How can computers do that? One thing that has always been evident about computers is that a virtual computer could run “inside” another. Surprisingly, this was known before computers as we know them today were even invented. The notion appeared, in inchoate form, in Alan Turing’s original paper describing computers, “On computable numbers” (1936). He noted that Universal Turing Machines (what we would call “general-purpose computers”) have the ability to simulate any other Universal Turing Machine. In other words, you can get a computer to act as if it were another computer.
There are many ways that this virtualization can occur. The most general way is via a program that literally reads the instructions that the processor would have executed directly and then perform the steps within another program. This intermediate program is known as an emulator because it emulates the way that a processor would act. The problem with this technique is that it is really slow. However, originally, it was the only way to achieve virtualization.
Processors were later developed whose mechanisms allowed the virtualized processes to run directly on the processors for most tasks but to run in a more roundabout way when executing operating system tasks. That is, you could tell the processor that certain parts of the code can be run directly but other parts require more care. This development enabled systems such as VMWare to work so well. For the most part, the virtualized systems run as an ordinary process on your computer, making use of the processor directly for most tasks.
These virtualization technologies allowed for the growth of “cloud computing,” which allowed systems administrators to purchase large computers and “provision” them into smaller ones. Costs and overhead of systems management were greatly reduced and an entire ecosystem of servers could be provisioned on demand. For more information about developing applications in this sort of environment, see my book Building Scalable PHP Web Applications Using the Cloud.
Increasing Isolation Inside the Operating System
Progress was also made on the operating system side. Operating systems have long kept different processes from interfering with each other as a primary job. This first big leap in making this possible was “virtual memory” (VM), which isolates each process’s access to your computer’s memory. When VM is used, one process literally cannot access another process’s memory.
Another interesting development in operating systems is known to systems administrators as “chroot jails.” This comes from the name of a command, chroot, which changes which parts of the filesystem a process is allowed to see. The command has long been popular with web server administrators. Oftentimes, administrators will increase the security of their systems by “chrooting” the web service to a directory with a small set of files. Then, if a hacker manages to find a security flaw in the web service and break in, access would be limited to the files available in the chroot jail.
Through the years, operating systems have added more and more features that allow processes to isolate themselves from each other. Processes can be blocked from even seeing that other processes exist and can even be isolated onto a separate network from the rest of the computer!
The advantage of process isolation over virtualization within a computer is that, in virtualization, every virtual machine requires the overhead of running, essentially, a complete operating system. However, an isolated process gets the benefit of sharing a lot more of the operating environment with the other programs. Thus, an isolated process is much more lightweight than a full virtual machine.
However, isolated processes are much harder to manage. A virtual machine doesn’t care that it is a virtual machine, and you can treat it exactly as if it were an ordinary bare-metal machine. A chrooted process, however, usually needs a bit of work to get it to run correctly.
The Birth of Containers
Eventually, programmers figured out a way manage isolated processes in a way that is very similar to the way that virtual machines are managed. This new technique of isolating processes in a way that behaves like a virtual machine is known as a “container.” The first system that implemented containers similar to those of today was LXC (Linux Containers). LXC pulled together all of the pieces and created a system which enabled a “lightweight VM,” a virtual machine that shared the core operating system between the different containers. Because of the advances of process isolation, these containers were even able to create a virtual network, and give each container its own location on the network.
The Union Filesystem
The early container systems had a problem, however. Each container still used a lot of disk space. While the containers shared the core of the operating system (known as the kernel), each one ended up needing its own full copy of the other essential parts of the operating system. It needed all the support files copied into each container. This is a lot of essentially wasted overhead. The problem was solved by the union filesystem.
The filesystem is the formatting that is applied to a hard drive. If you are on Windows, the filesystem is known as NTFS. If you are on a Mac, your hard drive is formatted to the Apple File System(APFS). For most filesystems, every file in a directory (and, usually, every file on the same hard drive) is on the same filesystem.
The union filesystem works by creating a number of read-only layers on your operating system, and overlaying them on each other. Only the top layer can be written to. You can actually modify (or even delete) any of the files in the read-only system. However, the result of doing that is not to change the read-only part, but rather to create a new entry in the topmost filesystem (which is read-write) which contains the changes. This allows you to share large portions of a filesystem, even in cases where the shared parts may need to be written to.
This modification is useful for containers because the read-only parts of a filesystem can be shared since they can’t be modified. Therefore, the operating system files, which don’t change from container to container, only have to exist once. They get put into a shared read-only filesystem, which is then used as a base layer for each container. Then, each container gets a read-write layer stuck on top of the base layer. Thus, to each container, it “feels” normal; each container has full authority to write to any part of its filesystem. However, because there is a large part that is shared and never ordinarily modified, you save a ton of disk space and can, essentially, have hundreds of full copies of an operating system running with practically zero increased overhead.
The Rise of Docker
These components containers and union filesystems became the core of the system known as Docker, which has now become almost synonymous with containers. Docker, over and above the basic container technology, also provides a well-defined system of container management.
Docker containers are defined by the images from which they are made. The images contain a series of filesystem layers that are packaged together. A “container” is a combination of an image that is running with a top-level read-write filesystem. Docker provides a registry for housing these images, and also a standard protocol for retrieving them so that other registry providers can integrate seamlessly. Additionally, these layers can be shared between multiple different images in order to save even more space.
To get a feel for how Docker containers can improve resource usage, imagine that you had three applications that you wanted to run in a virtualized environment, two of them running in a Ruby environment and one of them running in a Python environment. To run these on virtual machines, you would need a complete copy of the operating system, the environment, and the application for each virtual machine that you ran. With Docker, however, you can have a base operating system layer that is shared by all of your images. So the operating system need only be stored once. Then, the environment can be stored in another layer. This means that the layer containing the Ruby environment can be shared between the two applications. Finally, the applications themselves can sit on a layer at the top. This provides the isolation level of a virtual machine but the overhead is as minimal as running the applications on the same machine.
In the next part of this series, we will cover the basic commands for running Docker.
How the Docker revolution will change your programming, Part 1 Since 2013, Docker (an operating system inside your current operating system) has grown rapidly in popularity. Docker is a “container” system that wraps the application and the operating system into a single bundle that can be easily deployed anywhere.