Ethan J. Jackson

Posted on Jul 9, 2020 • Originally published at kelda.io

Using Data Containers To Boot Your Development Environment In Seconds

#docker #devops #database #tutorial

One of the most time consuming parts of booting a Docker development environment is initializing databases. The data container pattern gets around this obstacle by taking advantage of some less known features of volumes. With data containers, you can easily distribute, maintain, and load your database's seed data.

Data containers are a commonly overlooked tool for building the nirvana of development environments: booting the environment with a single command that works every time.

You're probably already using volumes to save you some time working with databases during development; if one of your development containers crashes, volumes will prevent you from losing the database's state. But interestingly, Docker volumes have some cool quirks that we can leverage for the data containers pattern.

In this post, I'll:

Explain why data containers are the best way to initialize your databases.
Explain how data containers work by taking advantage of some unusual behavior that isn't present in other volume implementations, such as Kubernetes volumes.
Walk you through a brief example on how to do it.

Just want the code?
Get it here and boot it with docker-compose up (or blimp up!)

Standard Techniques for Initializing Databases

When developing using Docker, there are three approaches developers commonly use for setting up their databases. All of them have serious drawbacks.

1) Initialize Your Database By Hand

Most people start by setting up their databases by hand. But this has several serious drawbacks:

This approach can be very time-consuming. For example, it's easy to spend an entire day copying the data you need from staging and figuring out how to seed your database with it. And, if you lose your volume, you have to do it all over again.
It's hard to sustain over time. I once worked with a team that dreaded destroying their database since they knew they'd have to re-initialize it later. As a result, they would do their best to avoid avoiding working on certain projects so they could spare themselves the pain of destroying and initializing their database.

2) Initialize Your Database Using a Script

Using a script can save you a lot of manual work. But it comes with its own set of headaches:

The script may take a while to run, slowing down the environment boot time.
In the rush of all the other work developers have to do, it's easy to put off maintaining the script. As the database's schema changes over time, the script breaks, and then you have to spend time debugging it.

3) Use a Remote Database

Using a remote database -- typically your staging database -- is certainly faster than running scripts or initializing your database by hand. But there's a big downside: you're sharing the database with other developers. That means you don't have a stable development environment. All it takes is one developer mucking up the data to ruin your day.

A Better Way: Data Containers

Data containers are containers that store your database's state, and are deployed like any other container in your Docker Compose file. They take advantage of some quirks of Docker volumes to copy the data from the container into the database, so that the database is fully initialized when it starts.

To see how volumes can speed up your development work with databases, let's take an example from the Magda data catalog system. Here's a snippet from the Magda Docker Compose file:

services:
  postgres:
    image: "gcr.io/magda-221800/magda-postgres:0.0.50-2"
    volumes:
      - 'db-data:/data'
    environment:
      - "PGDATA=/data"

  postgres-data:
    image: "gcr.io/magda-221800/magda-postgres-data:0.0.50-2"
    entrypoint: "tail -f /dev/null"
    volumes:
      - 'db-data:/data'

volumes:
  db-data:

When you run docker-compose up in the Magda repo, all the Magda services start, and the Postgres database is automatically initialized.

How It Works

This setup takes advantages of two features of Docker volumes:

1) Docker copies any files masked by volumes into the volume. The Magda example has the following in its Docker Compose file.

  postgres-data:
    image: "gcr.io/magda-221800/magda-postgres-data:0.0.50-2"
    entrypoint: "tail -f /dev/null"
    volumes:
      - 'db-data:/data'

When postgres-data starts, it mounts a volume to /data. Because we built the gcr.io/magda-221800/magda-postgres-data image to already have database files at /data, Docker copies those files into the volume.

2) Volumes can be shared between containers. So any files written to db-data by postgres-data are visible in the postgres container because the postgres container also mounts the db-data volume:

  postgres:
    image: "gcr.io/magda-221800/magda-postgres:0.0.50-2"
    environment:
      - "PGDATA=/data"
    volumes:
      - 'db-data:/data'

Putting this all together, when you run docker-compose up, the following happens:

Docker copies /data from postgres-data to the db-data volume
Docker starts the postgres container.
The postgres container starts, and boots with the data in its data directory.

In short, instead of having to spend a lot of time repeatedly initializing your databases by hand or creating and maintaining scripts, with remarkably little work you are good to go. You've got a fully automated system in place that will work every time.

Benefits

This approach has 3 major benefits:

It's a huge timesaver for developers -- booting is now super quick, and developers don't have to manually add data or create and maintain scripts.
It ensures that everyone on your team is working with an identical set of data. To get the newest version of the data, all you need to do is docker pull the data image, just like any other container. In fact, docker-compose will do that for them, so they don’t even have to think about it when onboarding.
It's easy to automate. There's a lot of existing tooling for automating Docker builds, which you can take advantage of with this approach.

Downsides

The main downside of this approach is that it can be hard to maintain the data container. Maintaining it by hand has the same downsides as initializing the database manually or with scripts -- the data can get stale as the db schema changes.

Teams that use this approach tend to generate their data containers using CI. The CI job snapshots and sanitizes the data from production or staging, and pushes it to the Docker registry. This way, the container generation is fully automated, and developers don't have to worry about it.

Conclusion

Data containers are a cool example of how Docker Compose does so much more than just boot up containers. Used properly, it can substantially increase developer productivity.

We're excited to share these developer productivity tips because we've noticed that the development workflow has become an afterthought during the move to containers. The complexity of modern applications requires new development workflows. We built Blimp so that development teams can quickly build and test containerized software, without having to reinvent a development environment approach.

Resources

Check out another trick for increasing developer productivity by using host volumes get rid of container rebuilds.

Try an example with Blimp to see how easily development on Docker Compose can be scaled into the cloud.

Read common Docker Compose mistakes for more tips on how to make development faster.

DEV Community

Using Data Containers To Boot Your Development Environment In Seconds

Standard Techniques for Initializing Databases

1) Initialize Your Database By Hand

2) Initialize Your Database Using a Script

3) Use a Remote Database

A Better Way: Data Containers

How It Works

Benefits

Downsides

Conclusion

Resources

Top comments (0)

Read next

Host a static website on AWS: A detailed step-by-step guide

🚀 React Patterns: Essential Tips and Tricks for Developers

The Essential Skills Every Rails Engineer Needs to Succeed in 2025

Building a subscription tracker Desktop and iOS app with compose multiplatform — Offline data