Provisioning EKS for AdaptDL

This page describes how to setup an AWS EKS cluster that auto-scales according to cluster load. Refer to other pages if you want to run AdaptDL on an existing Kubernetes cluster, or on an a single node with MicroK8s.

Note

The instruction on this page assume eksctl, kubectl, helm and awscli are installed locally. You can follow this guide to install all the tools needed.

Attention

This guide will provision AWS resources which will cost money. As of August 2020, you pay $0.10 per hour for each Amazon EKS cluster that you create. $0.30 GB-Month for the EFS storage and $0.526 per hour per g4dn.xlarge instance that you will end up using, starting with one. Note because the cluster is auto-scaling, additional instances will be spawned only when needed and you will be charged only for the duration of their lifetimes.

Provisioning the Cluster

You may use the provided manifest to create the cluster. Some configurations may be changed as per your preferences by downloading and modifying the file.

eksctl create cluster -f https://raw.githubusercontent.com/petuum/adaptdl/master/deploy/eks/adaptdl-eks-cluster-on-demand.yaml

This will provision an elastic EKS cluster with name adaptdl-eks-cluster with 1 minimum and 4 maximum nodes in the us-west-2 region. All nodes are on-demand g4dn.xlarge instances with a single GPU each. You can change the instance type and auto-scaling limits by changing nodeGroups.instanceType, nodeGroups.minSize, and nodeGroups.maxSize, respectively. You can also change the cluster name, AWS region of your choice.

Make sure the CLUSTER_NAME and AWS_REGION environment variables reflect the correct values after this step, for example:

export CLUSTER_NAME=adaptdl-eks-cluster
export AWS_REGION=us-west-2

Provisioning EFS

AdaptDL depends on a distributed filesystem like EFS to save and load checkpoints during training. You may follow the instructions from this website to provision an EFS volume for your cluster.

Next, install the EFS provisioner Helm chart. Make sure you have set the FILE_SYSTEM_ID environment variable according to the linked instructions.

helm repo add stable https://kubernetes-charts.storage.googleapis.com/

helm repo update

helm install stable/efs-provisioner \
--set efsProvisioner.efsFileSystemId=$FILE_SYSTEM_ID \
--set efsProvisioner.awsRegion=$AWS_REGION \
--generate-name

Installing the Cluster Autoscaler

helm repo add autoscaler https://kubernetes.github.io/autoscaler

helm repo update

helm install autoscaler/cluster-autoscaler-chart \
--set autoDiscovery.clusterName=$CLUSTER_NAME \
--set awsRegion=$AWS_REGION \
--generate-name

To verify that cluster-autoscaler has started, run:

kubectl --namespace=default get pods -l "app.kubernetes.io/name=aws-cluster-autoscaler-chart"

Should show the Cluster Autoscaler pod as Running

Installing the NVIDIA Plugin

kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.6.0/nvidia-device-plugin.yml

(Optional) Registry Access

If you will be using AdaptDL’s insecure registry, you will need to add a new rule to the security group associated with the nodes of the cluster. You may need help from your AWS administrator to perform this step.

SECURITY_GROUP=$(aws cloudformation describe-stack-resources --stack-name \
eksctl-$CLUSTER_NAME-nodegroup-ng-1 --query \
'StackResources[?LogicalResourceId == `SG`].[PhysicalResourceId]' --output text)

aws ec2 authorize-security-group-ingress --group-id $SECURITY_GROUP \
--protocol tcp --port 32000 --cidr 0.0.0.0/0

Cleaning Up

Once you are done with the cluster, you can clean up all AWS resources with:

eksctl delete cluster --name $CLUSTER_NAME

for target in `aws efs describe-mount-targets --file-system-id $FILE_SYSTEM_ID --query 'MountTargets[].MountTargetId' --output text`; \
do aws efs delete-mount-target --mount-target-id $target; done

aws efs delete-file-system --file-system-id $FILE_SYSTEM_ID

Next Steps

Once your EKS cluster is provisioned and running, you can deploy the AdaptDL scheduler.