Databricks has an experimental feature where you may customize your runtime container using Docker containers. In this article, we show how you can quickly manage your Java/Scala dependencies when creating a Docker container for data science purpose.
To start off, we will build our Docker container based off of the databricksruntime/standard Docker image. Our Dockerfile
looks like the following.
FROM databricksruntime/standard:latest
LABEL One-Off Coder "info@oneoffcoder.com"
# update ubuntu
RUN apt-get update \
&& apt-get install -y \
build-essential \
python3-dev \
&& apt-get clean
# update conda
RUN /databricks/conda/bin/conda update -n base -c defaults conda
COPY environment.yml /tmp/environment.yml
RUN /databricks/conda/bin/conda env update --file /tmp/environment.yml
# install maven
RUN wget -q http://mirror.metrocast.net/apache/maven/maven-3/3.6.1/binaries/apache-maven-3.6.1-bin.tar.gz -O /tmp/maven.tar.gz \
&& tar xvfz /tmp/maven.tar.gz -C /opt \
&& ln -s /opt/apache-maven-3.6.1 /opt/maven
# install jars
COPY pom.xml /tmp/pom.xml
RUN cd /tmp \
&& /opt/maven/bin/mvn dependency:copy-dependencies -DoutputDirectory=/databricks/jars
# clean up
RUN rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/*
Note the importance of using Maven
to manage the direct and transitive dependencies. Using Maven
with a pom.xml
, we can download all the dependencies we need and place the artifacts (jar files) in the /databricks/jars
directory.
That’s it! Now you may build the container as follows.
docker build --no-cache -t databricks-java:local .
To use this image in Databricks, you have to ask for the Customized Containers features to be enabled. Whether you are on AWS or Azure, you will be able to use your customized data science container. The full source code is available on GitHub and the container is published on Docker Hub.