DEV Community

David Haley
David Haley

Posted on

Container size analysis: TensorFlow 2.8 base image vs Deep Learning

TLDR, building our DeepCell container from a base TensorFlow image is 50% faster to load and 60% smaller than using the Deep Learning container.

Deep Learning image Base TF image Reduction
Uncompressed 19.5 GB 7.2 GB 63%
Compressed 8.4 GB 3.2 GB 62%
Batch job load time 6 min 3 min 50%

This post covers how we rebuilt our container on the smaller base image; and why the Deep Learning container is so big to begin with. The long and short of it is that you pay a steep price to have so many development tools available, and you typically don't need those for production tasks.

Optimizing our container

Our DeepCell journey began on Vertex AI. Google provides pre-built TensorFlow images as part of their Deep Learning Container Images.

These containers purport to let you:

Quickly prototype with a portable and consistent environment for developing, testing, and deploying your AI applications with Deep Learning Containers. These Docker images use popular frameworks and are performance optimized, compatibility tested, and ready to deploy.

Cool beans. Our DeepCell version uses TF2.8 so we picked this image from Google's list: us-docker.pkg.dev/deeplearning-platform-release/gcr.io/tf2-gpu.2-8.py37

It runs Python 3.7 which fortunately is still supported by DeepCell. (I've had mixed experiences with python version support across bioinformatics tools)

Our initial container build was simple:

FROM us-docker.pkg.dev/deeplearning-platform-release/gcr.io/tf2-gpu.2-8.py37

ADD https://api.github.com/repos/dchaley/deepcell-imaging/git/refs/heads/main version.json

RUN git clone https://github.com/dchaley/deepcell-imaging.git

WORKDIR "/deepcell-imaging"

RUN pip install --user --upgrade --quiet -r requirements.txt

ENTRYPOINT ["python", "benchmarking/deepcell-e2e/benchmark.py"]
Enter fullscreen mode Exit fullscreen mode

Our requirements file is pretty simple. We verified in the build logs that it didn't reinstall TensorFlow; note that the packages to install do not include TF:

Requirement already satisfied: tensorflow~=2.8.0 in /opt/conda/lib/python3.7/site-packages (from deepcell==0.12.9->-r requirements.txt (line 1)) (2.8.4)

...

Installing collected packages: tensorflow-addons, snakeviz, smart_open, qtpy, opencv-python-headless, lxml, jupyter-core, iniconfig, imagecodecs, cython, pytest, google-api-core, deepcell-toolbox, qtconsole, jupyter-console, deepcell-tracking, google-cloud-notebooks, google-cloud-bigquery, spektral, google-cloud-aiplatform, jupyter, deepcell
Enter fullscreen mode Exit fullscreen mode

This resulted in a whopping ~20 GB container 😩

Screenshot of Docker desktop's image list

The compressed artifact size was ~8.5 GB: this is the amount of data that must be transmitted before unpacking.

Screenshot of Artifact Registry showing the 8.4 GB container size.

The impact of all this? A six minute start time for Google Batch jobs, as defined from starting the container download …

2024-04-30 14:56:20.896 PDT
gce: Pulling from deepcell-on-batch/deepcell-benchmarking-us-central1/benchmarking
Enter fullscreen mode Exit fullscreen mode

… until executing the container:

2024-04-30 15:02:23.233 PDT
Executing runnable container:
Enter fullscreen mode Exit fullscreen mode

I wasn't thrilled with a six-minute minimum feedback cycle 😤 We tried image streaming to reduce startup time but alas, the container was so large it couldn't run without provisioning additional boot disk space.

We figured we must be able to build a container from a slimmer TensorFlow base image. We knew the DeepCell team had done some work scaling DeepCell using Kubernetes on GKE. Their Dockerfile confirmed that; just use TF's image.

We switched our base to TF's, grabbed the apt maintenance work they did, and updated our Dockerfile [diff].

The result; 7.2 GB uncompressed and 3.2 GB compressed. And ~3min time from starting to fetch the container to beginning to execute it.

Deep Learning image Base TF image Reduction
Uncompressed 19.5 GB 7.2 GB 63%
Compressed 8.4 GB 3.2 GB 62%
Batch job load time 6 min 3 min 50%

That's better 😎 But I couldn't help but wonder … why?

Container size analysis

Let's deep dive on what's on the containers. The containers are too large to open in Cloud Shell 🫠 so we'll do it the old fashioned way on local.

Let's use ncdu to explore the file system.

Deep Learning

This container was built from the Deep Learning base. Let's boot it up & install ncdu.

$ docker run -it --entrypoint bash us-central1-docker.pkg.dev/deepcell-on-batch/deepcell-benchmarking-us-central1/benchmarking@sha256:8cc9b89e5869a4d468d64810b2ae47e242cc106519b2b8d7c4a9daa07856bdde
root@55a486270459:/deepcell-imaging# apt update && apt install ncdu
Enter fullscreen mode Exit fullscreen mode

Begin scanning the root directory:

root@55a486270459:/deepcell-imaging# ncdu /
Enter fullscreen mode Exit fullscreen mode

It scans pretty quickly. Here's the summary:

Screenshot of the ncdu tool

So far this just tells us we have a lot in usr and opt (common places to install libraries). Let's start with usr.

    6.6 GiB [ 53.9%] /lib
    4.9 GiB [ 39.6%] /local
  363.3 MiB [  2.9%] /share
  276.8 MiB [  2.2%] /bin
  144.1 MiB [  1.1%] /src
Enter fullscreen mode Exit fullscreen mode

A bit odd to have stuff in both lib and local; but let's see. lib is mostly CUDA Deep Neural Network:

--- /usr/lib -------------------------
                     /..
    5.5 GiB [ 83.2%] /x86_64-linux-gnu
  938.5 MiB [ 13.8%] /google-cloud-sdk



--- /usr/lib/x86_64-linux-gnu ----------------------------
                      /..
    1.4 GiB [ 24.5%]  libcudnn_static.a
  956.8 MiB [ 16.9%]  libnvinfer_builder_resource.so.8.6.1
  839.4 MiB [ 14.8%]  libcudnn_cnn_infer_static.a
  675.1 MiB [ 11.9%]  libcudnn_cnn_infer.so.8.2.0
  271.8 MiB [  4.8%]  libcudnn_ops_infer.so.8.2.0
  227.3 MiB [  4.0%]  libcudnn_cnn_train_static.a
  225.5 MiB [  4.0%]  libnvinfer.so.8.6.1
Enter fullscreen mode Exit fullscreen mode

Static libraries are used to compile from source. We aren't doing that. Maybe we need the dynamic libraries for inference, I'm not sure. But the static libraries here are over 2.5 GB…

Surprising also to see a gig in the cloud sdk… it looks like the sdk ships its own Python distro and some other stuff.

--- /usr/lib/google-cloud-sdk --
                     /..
  382.3 MiB [ 40.7%] /lib
  296.7 MiB [ 31.6%] /platform
  169.5 MiB [ 18.1%] /bin
Enter fullscreen mode Exit fullscreen mode

As for /usr/local:

--- /usr/local ----------------
                     /..
    3.4 GiB [ 70.0%] /cuda-11.3
  850.0 MiB [ 17.0%] /share
  603.9 MiB [ 12.1%] /cuda-12.2
Enter fullscreen mode Exit fullscreen mode

Well… do we actually need 2 versions of CUDA? (Why is 12.2 so much smaller?) About half of the 11.3 version is static libraries again.

So far we're at ~4 GB of CUDA-related static libraries (which we don't need).

How about that /usr/local/share directory…

--- /usr/local/share/.cache --
                     /..
  850.0 MiB [100.0%] /yarn
Enter fullscreen mode Exit fullscreen mode

A gig of yarn package caches 😑 ~5 GB of stuff we don't need.

Alright, bouncing back to /opt (the other big directory, with 6 GB):

--- /opt --------------------
                     /..
    4.8 GiB [ 79.1%] /conda
    1.3 GiB [ 20.9%] /nvidia
Enter fullscreen mode Exit fullscreen mode

Conda is a python distribution, let's check out what's in nvidia:

--- /opt/nvidia --------------------
                     /..
    1.3 GiB [100.0%] /nsight-compute



--- /opt/nvidia/nsight-compute -----
                     /..
  651.5 MiB [ 50.0%] /2021.1.1
  651.3 MiB [ 50.0%] /2021.1.0
Enter fullscreen mode Exit fullscreen mode

So we have half a gig on an old version. What is nsight anyhow?

NVIDIA Nsight™ Systems is a system-wide performance analysis tool

Well we don't need that … so, we're at ~6 GB stuff we don't need. Let's go back to /opt/conda (~5 GB); as expected most of the stuff is in packages & libraries:

--- /opt/conda -----------
                     /..
    4.5 GiB [ 94.0%] /pkgs
    3.2 GiB [ 67.0%] /lib
Enter fullscreen mode Exit fullscreen mode

Most of the 4.5 GB of pkgs is in something called dlenv-tf-2-8-gpu-1.0.20230926-py37hab20f5e_0 which in turn is ~3 GB of libraries.

--- /opt/conda/pkgs/dlenv-tf-2-8-gpu-1.0.20230926-py37hab20f5e_0 -----
                     /..
    2.9 GiB [ 81.3%] /lib
  623.1 MiB [ 17.3%] /share
Enter fullscreen mode Exit fullscreen mode

The libraries are Python 3.7 site-packages, mostly Tensorflow (1 GB), and a bunch of small Python libraries. We presumably need this stuff!

--- /opt/conda/pkgs/dlenv-tf-2-8...e_0/lib/python3.7/site-packages ---
                     /..
    1.1 GiB [ 39.5%] /tensorflow
  282.5 MiB [  9.7%] /ray
  116.9 MiB [  4.0%] /pyarrow
   98.2 MiB [  3.4%] /llvmlite
   84.3 MiB [  2.9%] /scipy
   83.9 MiB [  2.9%] /sklearn
   78.8 MiB [  2.7%] /plotly
   69.5 MiB [  2.4%] /tensorflow_io
   58.6 MiB [  2.0%] /clang
   50.3 MiB [  1.7%] /apache_beam
   46.6 MiB [  1.6%] /google
Enter fullscreen mode Exit fullscreen mode

How about share ?

--- /opt/conda/pkgs/dlenv-tf-2-8...0.20230926-py37hab20f5e_0/share ---
                     /..
  621.3 MiB [ 99.7%] /jupyter

--- /opt/conda/pkgs/dlenv-tf-2-8...f5e_0/share/jupyter/lab/staging ---
                     /..
  480.7 MiB [ 88.5%] /node_modules
   57.1 MiB [ 10.5%] /build
Enter fullscreen mode Exit fullscreen mode

Half a gig for Jupyter's JS dependencies & build files. So, ~6.5 unused stuff.

How about the lib sibling to pkgs (3.2 GB) ? Almost all of it is … another Python distribution?

--- /opt/conda/lib/python3.7 ------
                     /..
    2.9 GiB [ 98.5%] /site-packages

--- /opt/conda/lib/python3.7/site-packages ---
                     /..
    1.1 GiB [ 38.3%] /tensorflow
  282.5 MiB [  9.4%] /ray
  117.0 MiB [  3.9%] /pyarrow
   98.2 MiB [  3.3%] /llvmlite
   84.3 MiB [  2.8%] /scipy
   83.9 MiB [  2.8%] /sklearn
   78.8 MiB [  2.6%] /plotly
   69.5 MiB [  2.3%] /tensorflow_io
   58.6 MiB [  1.9%] /clang
   50.8 MiB [  1.7%] /google
   50.3 MiB [  1.7%] /apache_beam
Enter fullscreen mode Exit fullscreen mode

These appear to be the same packages as the dlenv-etc folder… ~3 GB of duplication, bringing our unused total to ~9.5 GB.

Since that's nearly all of our ~12 GB difference I stopped here.

Container size analysis: TensorFlow base

Let's do a quick scan of the container built off the base TensorFlow image.

Let's open up the container. Ooh, fancy...

Screenshot of the TensorFlow container showing ASCII art of the word TensorFlow

root@2317ea736b48:/deepcell-imaging# apt update && apt install ncdu
root@2317ea736b48:/deepcell-imaging# ncdu /
Enter fullscreen mode Exit fullscreen mode

This time most of the contents are in usr and root

--- / --------------------
    5.5 GiB [ 79.7%] /usr
    1.3 GiB [ 19.0%] /root
Enter fullscreen mode Exit fullscreen mode

Most of root is Python 3.8 libraries, which is a lot of small libraries:

--- /root/.local/lib/python3.8 ----
                     /..
    1.0 GiB [100.0%] /site-packages

--- /root/.local/lib/python3.8/site-packages ----
                     /..
   85.5 MiB [  8.5%] /scipy
   83.6 MiB [  8.3%] /google
   74.5 MiB [  7.4%] /imagecodecs
   72.3 MiB [  7.2%] /cv2
   62.1 MiB [  6.1%] /opencv_python_headless.libs
   61.9 MiB [  6.1%] /pandas
   45.7 MiB [  4.5%] /sklearn
Enter fullscreen mode Exit fullscreen mode

whereas /usr looks like this:

--- /usr ------------------
                     /..
    3.1 GiB [ 57.2%] /local
    2.2 GiB [ 40.1%] /lib
Enter fullscreen mode Exit fullscreen mode

Almost all of lib is CUDA DNN:

--- /usr/lib/x86_64-linux-gnu -------------------
                     /..
  757.3 MiB [ 36.9%]  libcudnn_cnn_infer.so.8.1.0
  442.8 MiB [ 21.6%]  libnvinfer.so.7.2.2
  267.4 MiB [ 13.0%]  libcudnn_ops_infer.so.8.1.0
Enter fullscreen mode Exit fullscreen mode

whereas local is split across more CUDA + python files:

--- /usr/local ----------------
                     /..
    1.7 GiB [ 55.4%] /cuda-11.2
    1.4 GiB [ 43.7%] /lib

--- /usr/local/cuda-11.2/targets/x86_64-linux/lib ---
                     /..
  382.7 MiB [ 25.4%]  libcusolver.so.11.1.0.152
  219.6 MiB [ 14.6%]  libcusparse.so.11.4.1.1152
  186.6 MiB [ 12.4%]  libcusolverMg.so.11.1.0.152
  181.3 MiB [ 12.0%]  libcufft.so.10.4.1.152
  176.7 MiB [ 11.7%]  libcublasLt.so.11.4.1.1043

--- /usr/local/lib/python3.8/dist-packages ---
                     /..
    1.1 GiB [ 84.0%] /tensorflow
Enter fullscreen mode Exit fullscreen mode

It looks like the CUDA DNN files in /usr/lib are different from the CUDA files in /usr/local.

Conclusions

The Deep Learning container seems better suited for:

  • compiling tools from source
  • training, not just predicting
  • using notebooks for iterative development
  • overall development tasks

The TensorFlow base image seems better suited for:

  • running the specific thing you want to run once you've figured out how to run it.

Future work?

Google has optimized container images for VertexAI. We'd use: us-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-gpu.2-8:latest

I get the sense from the docs these only work on Vertex AI & need you to train the model on Vertex AI as well:

The optimization occurs when Vertex AI uploads a model, before it runs.

At some point it may be worth investigating the cost of predicting via Vertex AI online models, vs, predicting with an open-source container on Batch. But, if the container is so large again because of training code, we may lose whatever benefits we gained…

Top comments (0)