Submitting a Simple Job

This page is an introduction to running AdaptDL jobs using a simple “Hello, world!” program. The goal is to show the basics of creating and interacting with AdaptDL jobs. For an introduction to modifying existing PyTorch code to use AdaptDL, please see AdaptDL with PyTorch.

Installation

python3 -m pip install adaptdl-cli

Writing a Simple Program

For the purpose of this guide, you will want a simple python script that produces output to adaptdl.env.share_path(), the directory used for your job for storing general files.

For example, you may copy the following code (into hello_world/hello_world.py):

import adaptdl.env
import os
import time

print("Hello, world!")

with open(os.path.join(adaptdl.env.share_path(), "foo.txt"), "w") as f:
    f.write("Hello, world!")

time.sleep(100)

Please note that stdout is only accessible while a job is still running. Therefore, the time.sleep(100) call is important for this tutorial.

Writing a Dockerfile

In order to run your application code, the job containers need access to the code directly. A simple method is to create a docker image containing the application.

Currently the adaptdl cli requires you to be able to push to and the cluster to be able to pull from a docker registry. This may be dockerhub, or it may be your own private docker registry. Please ensure that that is set up before proceeding.

Copy the following docker file into hello_world/Dockerfile:

FROM python:3.7-slim
RUN python3 -m pip install adaptdl

COPY hello_world.py /root/hello_world.py

ENV PYTHONUNBUFFERED=true

Tip

If the Dockerfile is not written carefully, the Docker build step can take a long time. Make sure to follow the best practices when writing your Dockerfile so your builds are as fast as possible:

In particular, you should (almost) always have a .dockerignore file that contains .git and other large files/directories which are not used in your containers.

Configuring the Job

AdaptDL jobs are specified as Kuberenetes Resource. The following yaml file defines the job specification for your hello world application:

Example (in hello_world/adaptdljob.yaml):

apiVersion: adaptdl.petuum.com/v1
kind: AdaptDLJob
metadata:
  generateName: hello-world-
spec:
  template:
    spec:
      containers:
      - name: main
        command:
        - python3
        - /root/hello_world.py

Submitting the Job

Run the following AdaptDL cli command from your client.

adaptdl submit hello_world

Note

If you are using Docker for Mac with AdaptDL’s built-in insecure registry, the first run of adaptdl submit may fail with an error similar to:

Get https://host.docker.internal:59283/v2/: x509: certificate signed by unknown authority

You may need to restart Docker, and adaptdl submit should work thereafer.

This will create the AdaptDL Kubernetes job object for your application. Once this is created, the AdaptDL scheduler will recognize the job and schedule it for execution. Please note that for this command to work, the docker file created in step 3 must be located in hello_world/Dockerfile and the yaml created in step 4 must be located in hello_world/adaptdljob.yaml.

Monitoring the Job

Once the job object has been created, you can find more information about the job using

adaptdl ls

This should produce some output similar to

Name                                                             Status     Start(UTC)    Runtime  Rplc  Rtrt
hello-world-kgjsc                                                Running    Aug-24 18:47  1 min    1     0

Once the Status is listed as Running and not Pending, then the AdaptDL scheduler has created pods for your AdaptDL job. Use the following command to find out more details about the pods:

kubectl get pods

This should produce an output that looks like

NAME                                                         READY   STATUS     RESTARTS   AGE
adaptdl-adaptdl-sched-856cc685c4-hhdks                       3/3     Running    0          8h
hello-world-kgjsc-a7fe6b49-e673-11ea-a27e-061e69fb5c39-0-0   1/1     Running    0          20s

Note that this gets all of the pods in the default namespace, including the scheduler. To restrict this to just the pods created for your job, use kubectl get pods | grep hello-world.

When the phase is listed as Running, as opposed to ContainerCreating, then you can get the stdout and stderr logs via the following, (replacing <pod-name> with the name value you got from kubectl get pods):

kubectl logs <pod-name>

This should produce output of Hello, world!.

Please note that this method of getting stdout and stderr output requires the pod to still exist. However, when an AdaptDL job finishes or rescales, the worker pods are deleted. For more durable logging, it is advised to write to a file.

Retrieving Output Files

Use the following to copy result files to your client machine. Please replace <adaptdl-job> with the name value from the output of adaptdl ls in step 10:

adaptdl cp <adaptdl-job>:/adaptdl/share/foo.txt foo.txt

foo.txt on your local client should then contain hello world

Deleting the Job

Delete the job with kubectl: kubectl delete adaptdljob <adaptdl-job>. Again, replace the name parameter with the one from before. This will delete the AdaptDL kubernetes object from your job, which will also delete any running pods or other attached resources. Please note that this may cause files the job has written to to no longer be available.

(Advanced) External Registry

If possible, we recommend using a secure external Docker registry instead of the default insecure registry installed along with the AdaptDL scheduler. To do this, you’ll need to export two environment variables to let AdaptDL know the full reponame to use, say registry.example.com/adaptdl-submit, along with registry credentials mysecret. Refer to this website for how to create one.

export ADAPTDL_SUBMIT_REPO=registry.example.com/adaptdl-submit
export ADAPTDL_SUBMIT_REPO_CREDS=mysecret

Then do docker login in with the registry credentials.