If you've had the experience of trying to get programs and configuration files installed the same way on multiple computers, then you can appreciate the appeal of Docker: what Conda and Pip provide to Python in environment reproducibility, Docker achieves for anything you can run on Linux. This is a big deal in the software engineering world, because installation wrangling can suck up a lot of developer time. It's also valuable to scientists and data scientists, because research can be reproduced with rock solid reliability with the execution of a single OS-independent command. These benefits have driven the steady rise in popularity that Docker has enjoyed since it was introduced in 2013.
To oversimplify a bit, using Docker involves figuring out how to build your desired computational environment by running a sequence of shell commands starting from a bare-bones Linux operating system. You store this sequence of commands in a text file called Dockerfile, and the results achieved by performing the specified installation steps are stored as a Docker image. Docker (the company) provides a free cross-platform application called Docker Desktop which allows any user to download Docker images from a repository of published Dockerfiles and run them on their own machine. These image instances, called containers, run in their own isolated filesystem on the user's computer.
Since no assumptions are made about anything else on the user's system, Docker applications reliably run the same way for everyone. Furthermore, Docker containers are not virtual machines, so they are lightweight and can run code with near-native performance. They are especially useful in cloud computing contexts, because you can debug a system on your computer and deploy it to the cloud without worrying about how to configure everything to work in the cloud the same way it works on your personal machine.
To give you a sense of how profound this can be, if you install Docker Desktop and run
docker run -p 8888:8888 jupyter/datascience-notebook
then several Docker images that are a part of the
datascience-notebook stack published by the Jupyter team will be downloaded to your machine. The download takes a while, but when it's complete, you will have a running Jupyter instance accessible at
localhost:8888 in your browser (the
-p 8888:8888 part of the command connects the port 8888 in the container to the port 8888 in the host operating system). This notebook will have Python, R, and Julia kernels, each complete with curated sets of data science packages. It would take much more work to follow a list of installation instructions to achieve the same setup using native installations. Furthermore, the time cost of downloading images is incurred only the first time you run the command, because downloaded images are saved on your computer for fast loading in the future.
Disadvantages of using Docker include: (1) running a given piece of software both through Docker and natively on your operating system requires having two installations of it, and (2) care must be taken to connect the container to your operating system so you can interact with it (for example, using Jupyter notebooks, or saving files from within the container and having them show up in your primary file system).
To see how Docker works and how we might use it in practice, let's take a closer look at the Jupyter data-science notebook. When we run
docker run -p 8888:8888 jupyter/datascience-notebook from the command line, we're telling Docker that we want a container running the
jupyter/datascience-notebook image. Docker Desktop is able to find that image because it's registered on Docker Hub. If we take a look at the Dockerfile used to build that image, we see a sequence of Dockerfile commands beginning with all-caps instructions. The most important ones are:
- FROM. Specifies an image to build on top of. This can be an image from Docker Hub or one you've built locally from another Dockerfile.
- RUN. Executes shell commands. Useful for downloading files from the internet and performing other installation steps.
- COPY. Copy files from the directory containing the Dockerfile into the image. Useful for configuration files or shell scripts (so you don't have to put all of the instructions into the Dockerfile).
- CMD. Specifies a default command to run when executing a container. The most common default is
bash(so running a container drops you into a shell session), but the Jupyter notebook images launch Jupyter Lab so you can connect to the container using your browser.
- EXPOSE. Make a container port available for the host operating system to connect to. For Jupyter, it's customary to use port
- USER. Some installation steps require enhanced filesystem permissions; the Dockerfile solution is to switch to the root user with the line
Let's use some of these command to make our own Docker image for a
. ├── README.md ← Explanation of the project and instructions on how to use ├── Dockerfile ← Script to build the Docker image ├── Makefile ← Encode project dependency structure for reproducibility ├── data │ ├── raw ← stores original data (untouched) │ └── processed ← stores files computed from original data ├── models ← stores Python objects for trained models ├── reports ← final writeup │ ├── figures │ └── report.tex └── src ← source code for processing data and models ├── features ← data processing │ └── build_features.py ├── models ← model training and prediction │ ├── predict_model.py │ └── train_model.py └── visualization ← generate figures └── visualize.R
You can do this by cloning a Git repo prepared for this purpose:
git clone firstname.lastname@example.org:data-gymnasia/data-science-docker.git
In our Dockerfile we begin with the following contents:
FROM jupyter/datascience-notebook # set working directory to home directory WORKDIR /home/jovyan # copy whole current directory into the image COPY . project # Get data from GitHub RUN cd project/data/raw && \ wget https://browndsi.github.io/data/iris.csv # Enter bash session in the project directory when # the container is run WORKDIR project CMD /bin/bash
We build on the Jupyter
datascience-notebook image, copy our local files into the image, acquire the data from the internet, and start the container in a
bash session. Then we build the docker image by running (from the top level of the directory)
docker build -t myproject .
-t myproject part tags the image with the name
myproject, and the dot means "the current directory" in Unix.
Unfortunately, this image won't build, because of permissions issues. Looking at Jupyter's Dockerfiles, we find some inspiration: a script called
fix-permissions. This script can only be run as the
root user, so we amend our Dockerfile to get this:
FROM jupyter/datascience-notebook # set working directory to home directory WORKDIR /home/jovyan # copy whole current directory into the image COPY . project # Get data from GitHub USER root RUN fix-permissions project && \ cd project/data/raw && \ wget https://browndsi.github.io/data/iris.csv USER jovyan # Enter bash session in the project directory when # the container is run WORKDIR project CMD /bin/bash
Then when we run
docker build -t myproject ., we get a successfully built image. We can see a list of our images by running
docker images at the command line, and we can run the image we just made with
docker run -i -t myproject
-t flags are for 'interactive' and 'terminal', indicating that we want to begin a shell session when we run the container.
After running this command, we have a command prompt inside our running container. We can do
cat Makefile to see how the Makefile encodes dependencies among the project components, as well as providing instructions for processing. Its contents are:
.PHONY: features train predict figures reports all all: reports features: src/features/build_features.py python src/features/build_features.py data/raw/ data/processed/ train: features src/models/train_model.py python src/models/train_model.py data/processed/ models/trained_model.joblib predict: train src/models/predict_model.py python src/models/predict_model.py data/processed/ models/trained_model.joblib reports/ figures: src/visualization/visualize.R Rscript src/visualization/visualize.R data/processed/ reports/figures/ reports: reports/report.tex predict figures cd reports && \ pdflatex report.tex && \ pdflatex report.tex
We can visualize the dependency structure described by this Makefile as a directed graph:
We can build the whole project from the Docker container with
make all. However, when we do that we realize that the
joblib package (which is being used by some of the Python files) isn't available in the Jupyter
datascience-notebook docker image. Therefore, we need to put that insstallation step into our Dockerfile and rebuild. We add the lines
# Install joblib for storing Python models. The # '--yes' option preempts "proceed?" questions RUN conda install joblib --yes
Building and running again, we can do
make all from inside the running container to produce a PDF in the
reports directory. We won't be able to view that file directly since it's inside the container. We'll need to copy it from the container to our operating system so that our OS's PDF viewing app can read it.
The command for transferring files out of containers is
docker cp. We'll need to know the name of the container, which we can get using
docker ps (note that this has to be run from your OS, so you should open a separate Terminal tab). In the last column of the
docker ps output, we see a random name like
great_mayer. Then you can copy the file to
~/Desktop (for example) using
docker cp great_mayer:/home/jovyan/project/reports/report.pdf ~/Desktop
We could have given our project a name with the
--name option when we did
docker run, and that would have allowed us to skip the
docker ps step.
docker cp utility can be inadequate for extensive file transferring between the container and host OS. Docker supports a more robust approach using volumes, which are directories shared between the container and host. You can read more about volumes here.
- To get files into a Docker image during its build, we use
- To see a list of all of the Docker images we have on our machine, we use
- To see a list of running containers, we do
- Jupyter uses the Dockerfile command
to connect the Jupyter server on the container to the browser on the host OS.
- The Dockerfile command
FROMis used to build an image on top of an existing image.
- The Dockerfile command
CMDcan be used to specify what executable should run when the container is started
Congratulations! You have completed the Data Gymnasia course on Data Science Utilities.