adaptdl.env module

This module contains functions for retrieving the values of AdaptDL environment variables, or their defaults if unset.

adaptdl.env.adaptdl_sched_version()[source]

A string which gives the AdaptDL version of scheduler. Determined by the environment variable ADAPTDL_SCHED_VERSION or None

Returns

AdaptDL version of scheduler, or None.

Return type

str

adaptdl.env.checkpoint_path()[source]

Path to the directory used for saving and loading checkpoints. Determined by the environment variable ADAPTDL_CHECKPOINT_PATH, or None if unset. Setting this environment variable is required for checkpointing, and is automatically set in AdaptDL-scheduled clusters.

Returns

checkpoint path or None.

Return type

str

adaptdl.env.from_ray()[source]

Returns True if the code is being called from Ray

adaptdl.env.job_id()[source]

A string which uniquely identifies the current job in an AdaptDL-scheduled cluster. None if running standalone.

Returns

unique job identifier or None.

Return type

str

adaptdl.env.master_addr()[source]

Network address of the rank 0 replica, required for distributed training. Determined by the environment variable ADAPTDL_MASTER_ADDR, or 0.0.0.0 if unset.

In AdaptDL-scheduled clusters, this environment variable is unset. The rank 0 replica is discovered dynamically by querying the supervisor (supervisor_url()).

Returns

address of the rank 0 replica, or 0.0.0.0.

Return type

str

adaptdl.env.master_port()[source]

Available port for the rank 0 replica, required for distributed training. Determined by the environment variable ADAPTDL_MASTER_PORT, or 0 if unset. Automatically set in AdaptDL-scheduled clusters.

Returns

available port for the rank 0 replica, or 0.

Return type

int

adaptdl.env.num_nodes()[source]

Number of unique nodes being used for the current job. For example, if there are 4 nodes, each running 2 replicas, then this function returns 4. Determined by the environment variable ADAPTDL_NUM_NODES, or is equal to num_replicas() if unset. Thus, this environment variable only needs to be set if some node runs multiple replicas. Automatically set in AdaptDL-scheduled clusters.

Returns

number of unique nodes, or the value of num_replicas().

Return type

int

adaptdl.env.num_replicas()[source]

Total number of replicas, required for distributed training. For example, if there are 4 nodes, each running 2 replicas, then this function returns 8. Determined by the environment variable ADAPTDL_NUM_REPLICAS, or 1 if unset. Automatically set in AdaptDL-scheduled clusters.

Returns

total number of replicas, or 1.

Return type

int

adaptdl.env.num_restarts()[source]

Number of times the current job was restarted. Determined by the environment variable ADAPTDL_NUM_RESTARTS, or 0 if unset. This value is mainly informational, and is automatically set in AdaptDL-scheduled clusters.

Returns

number of restarts, or 0.

Return type

int

adaptdl.env.replica_rank()[source]

Rank of the current replica, required for distributed training. Each replica is assigned a unique rank from 0 to K-1, where K is the total number of replicas. Determined by the environment variable ADAPTDL_REPLICA_RANK, or 0 if unset. Automatically set in AdaptDL-scheduled clusters.

Returns

rank of the current replica, or 0.

Return type

int

adaptdl.env.share_path()[source]

Path to a directory shared by all AdaptDL job replicas, which can be used by the application, e.g. for storing downloaded datasets or artifacts. Determined by the environment variable ADAPTDL_SHARE_PATH, or None if unset. Automatically set in AdaptDL-scheduled clusters.

Returns

shared directory path or None.

Return type

str

adaptdl.env.supervisor_url()[source]

URL of the supervisor in an AdaptDL-scheduled cluster. The address of the rank 0 replica is dynamically discovered via the supervisor, instead of via the ADAPTDL_MASTER_ADDR environment variable.

Returns

URL of the supervisor, or None.

Return type

str