• 703-743-9010
  • info@oneoffcoder.com
  • 7526 Old Linton Hall Rd, Gainesville VA, 20155

Customized data science runtime containers for Databricks

Databricks has an experimental feature where you may customize your runtime container using Docker containers. In this article, we show how you can quickly build a Docker container for data science purpose. In particular, we will create a Docker container for use with Databricks with some Natural Language Processing (NLP) packages.

To start off, we will build our Docker container based off of the databricksruntime/standard Docker image. This image uses Miniconda to manage environments and dependencies. The Miniconda environment that is used is called dcs-minimal, and we will assert and insert all our required NLP dependencies into this environment. Below is the environment.yml file that we will use to declare which packages will be installed.

name: dcs-minimal
  - default
  - anaconda
  - pip:
    - pyarrow==0.13.0
    - azure==3.0.0
    - scipy
    - scikit-learn
    - spacy
    - nltk
    - gensim
    - textblob
    - allennlp
    - seaborn
    - flashtext
  - pip
  - python=3.7.3
  - six=1.12.0
  - nomkl=3
  - ipython=7.4.0
  - numpy=1.16.2
  - pandas=0.24.2

Note that we have declared the following NLP packages: spaCy, gensim, TextBlob, AllenNLP and FlashText.

The next thing to do is define our Dockerfile, which looks like the following.

FROM databricksruntime/standard:latest
LABEL One-Off Coder "info@oneoffcoder.com"

# updates and install ubuntu packages
RUN apt-get update \
    && apt-get install -y \
        build-essential \
        python3-dev \
    && apt-get clean

# updates conda itself
RUN /databricks/conda/bin/conda update -n base -c defaults conda
# copies over the environment.yml file
COPY environment.yml /tmp/environment.yml
# updates the environment, dcs-minimal
RUN /databricks/conda/bin/conda env update --file /tmp/environment.yml
# download spaCy's english language model
# download all NLTK packages
RUN /databricks/conda/envs/dcs-minimal/bin/python -m spacy download en_core_web_lg \
    && /databricks/conda/envs/dcs-minimal/bin/python -m nltk.downloader all

# clean up
RUN rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/*

That’s it! Now you may build the container as follows.

docker build --no-cache -t databricks-nlp:local .

To use this image in Databricks, you have to ask for the Customized Containers features to be enabled. Whether you are on AWS or Azure, you will be able to use your customized data science container. The full source code is available on GitHub and the container is published on Docker Hub.