One of the most time consuming parts of booting a Docker development environment is initializing databases. The data container pattern gets around this obstacle by taking advantage of some less known features of volumes. With data containers, you can easily distribute, maintain, and load your database's seed data.
Data containers are a commonly overlooked tool for building the nirvana of development environments: booting the environment with a single command that works every time.
You're probably already using volumes to save you some time working with databases during development; if one of your development containers crashes, volumes will prevent you from losing the database's state. But interestingly, Docker volumes have some cool quirks that we can leverage for the data containers pattern.
In this post, I'll:
- Explain why data containers are the best way to initialize your databases.
- Explain how data containers work by taking advantage of some unusual behavior that isn't present in other volume implementations, such as Kubernetes volumes.
- Walk you through a brief example on how to do it.
Just want the code?
Get it here and boot it with docker-compose up
(or blimp up
!)
Standard Techniques for Initializing Databases
When developing using Docker, there are three approaches developers commonly use for setting up their databases. All of them have serious drawbacks.
1) Initialize Your Database By Hand
Most people start by setting up their databases by hand. But this has several serious drawbacks:
- This approach can be very time-consuming. For example, it's easy to spend an entire day copying the data you need from staging and figuring out how to seed your database with it. And, if you lose your volume, you have to do it all over again.
- It's hard to sustain over time. I once worked with a team that dreaded destroying their database since they knew they'd have to re-initialize it later. As a result, they would do their best to avoid avoiding working on certain projects so they could spare themselves the pain of destroying and initializing their database.
2) Initialize Your Database Using a Script
Using a script can save you a lot of manual work. But it comes with its own set of headaches:
- The script may take a while to run, slowing down the environment boot time.
- In the rush of all the other work developers have to do, it's easy to put off maintaining the script. As the database's schema changes over time, the script breaks, and then you have to spend time debugging it.
3) Use a Remote Database
Using a remote database -- typically your staging database -- is certainly faster than running scripts or initializing your database by hand. But there's a big downside: you're sharing the database with other developers. That means you don't have a stable development environment. All it takes is one developer mucking up the data to ruin your day.
A Better Way: Data Containers
Data containers are containers that store your database's state, and are deployed like any other container in your Docker Compose file. They take advantage of some quirks of Docker volumes to copy the data from the container into the database, so that the database is fully initialized when it starts.
To see how volumes can speed up your development work with databases, let's take an example from the Magda data catalog system. Here's a snippet from the Magda Docker Compose file:
services:
postgres:
image: "gcr.io/magda-221800/magda-postgres:0.0.50-2"
volumes:
- 'db-data:/data'
environment:
- "PGDATA=/data"
postgres-data:
image: "gcr.io/magda-221800/magda-postgres-data:0.0.50-2"
entrypoint: "tail -f /dev/null"
volumes:
- 'db-data:/data'
volumes:
db-data:
When you run docker-compose up
in the Magda repo, all the Magda services start, and the Postgres database is automatically initialized.
How It Works
This setup takes advantages of two features of Docker volumes:
1) Docker copies any files masked by volumes into the volume. The Magda example has the following in its Docker Compose file.
postgres-data:
image: "gcr.io/magda-221800/magda-postgres-data:0.0.50-2"
entrypoint: "tail -f /dev/null"
volumes:
- 'db-data:/data'
When postgres-data
starts, it mounts a volume to /data
. Because we built the gcr.io/magda-221800/magda-postgres-data
image to already have database files at /data
, Docker copies those files into the volume.
2) Volumes can be shared between containers. So any files written to db-data
by postgres-data
are visible in the postgres
container because the postgres
container also mounts the db-data
volume:
postgres:
image: "gcr.io/magda-221800/magda-postgres:0.0.50-2"
environment:
- "PGDATA=/data"
volumes:
- 'db-data:/data'
Putting this all together, when you run docker-compose up
, the following happens:
- Docker copies
/data
frompostgres-data
to thedb-data
volume - Docker starts the
postgres
container. - The
postgres
container starts, and boots with the data in its data directory.
In short, instead of having to spend a lot of time repeatedly initializing your databases by hand or creating and maintaining scripts, with remarkably little work you are good to go. You've got a fully automated system in place that will work every time.
Benefits
This approach has 3 major benefits:
- It's a huge timesaver for developers -- booting is now super quick, and developers don't have to manually add data or create and maintain scripts.
- It ensures that everyone on your team is working with an identical set of data. To get the newest version of the data, all you need to do is
docker pull
the data image, just like any other container. In fact,docker-compose
will do that for them, so they don’t even have to think about it when onboarding. - It's easy to automate. There's a lot of existing tooling for automating Docker builds, which you can take advantage of with this approach.
Downsides
The main downside of this approach is that it can be hard to maintain the data container. Maintaining it by hand has the same downsides as initializing the database manually or with scripts -- the data can get stale as the db schema changes.
Teams that use this approach tend to generate their data containers using CI. The CI job snapshots and sanitizes the data from production or staging, and pushes it to the Docker registry. This way, the container generation is fully automated, and developers don't have to worry about it.
Conclusion
Data containers are a cool example of how Docker Compose does so much more than just boot up containers. Used properly, it can substantially increase developer productivity.
We're excited to share these developer productivity tips because we've noticed that the development workflow has become an afterthought during the move to containers. The complexity of modern applications requires new development workflows. We built Blimp so that development teams can quickly build and test containerized software, without having to reinvent a development environment approach.
Resources
Check out another trick for increasing developer productivity by using host volumes get rid of container rebuilds.
Try an example with Blimp to see how easily development on Docker Compose can be scaled into the cloud.
Read common Docker Compose mistakes for more tips on how to make development faster.
Top comments (0)