adaptdl.env module¶
This module contains functions for retrieving the values of AdaptDL environment variables, or their defaults if unset.
- adaptdl.env.adaptdl_sched_version()[source]¶
A string which gives the AdaptDL version of scheduler. Determined by the environment variable
ADAPTDL_SCHED_VERSION
orNone
- Returns
AdaptDL version of scheduler, or
None
.- Return type
str
- adaptdl.env.checkpoint_path()[source]¶
Path to the directory used for saving and loading checkpoints. Determined by the environment variable
ADAPTDL_CHECKPOINT_PATH
, orNone
if unset. Setting this environment variable is required for checkpointing, and is automatically set in AdaptDL-scheduled clusters.- Returns
checkpoint path or
None
.- Return type
str
- adaptdl.env.job_id()[source]¶
A string which uniquely identifies the current job in an AdaptDL-scheduled cluster.
None
if running standalone.- Returns
unique job identifier or
None
.- Return type
str
- adaptdl.env.master_addr()[source]¶
Network address of the rank 0 replica, required for distributed training. Determined by the environment variable
ADAPTDL_MASTER_ADDR
, or 0.0.0.0 if unset.In AdaptDL-scheduled clusters, this environment variable is unset. The rank 0 replica is discovered dynamically by querying the supervisor (
supervisor_url()
).- Returns
address of the rank 0 replica, or 0.0.0.0.
- Return type
str
- adaptdl.env.master_port()[source]¶
Available port for the rank 0 replica, required for distributed training. Determined by the environment variable
ADAPTDL_MASTER_PORT
, or 0 if unset. Automatically set in AdaptDL-scheduled clusters.- Returns
available port for the rank 0 replica, or 0.
- Return type
int
- adaptdl.env.num_nodes()[source]¶
Number of unique nodes being used for the current job. For example, if there are 4 nodes, each running 2 replicas, then this function returns 4. Determined by the environment variable
ADAPTDL_NUM_NODES
, or is equal tonum_replicas()
if unset. Thus, this environment variable only needs to be set if some node runs multiple replicas. Automatically set in AdaptDL-scheduled clusters.- Returns
number of unique nodes, or the value of
num_replicas()
.- Return type
int
- adaptdl.env.num_replicas()[source]¶
Total number of replicas, required for distributed training. For example, if there are 4 nodes, each running 2 replicas, then this function returns 8. Determined by the environment variable
ADAPTDL_NUM_REPLICAS
, or 1 if unset. Automatically set in AdaptDL-scheduled clusters.- Returns
total number of replicas, or 1.
- Return type
int
- adaptdl.env.num_restarts()[source]¶
Number of times the current job was restarted. Determined by the environment variable
ADAPTDL_NUM_RESTARTS
, or 0 if unset. This value is mainly informational, and is automatically set in AdaptDL-scheduled clusters.- Returns
number of restarts, or 0.
- Return type
int
- adaptdl.env.replica_rank()[source]¶
Rank of the current replica, required for distributed training. Each replica is assigned a unique rank from 0 to K-1, where K is the total number of replicas. Determined by the environment variable
ADAPTDL_REPLICA_RANK
, or 0 if unset. Automatically set in AdaptDL-scheduled clusters.- Returns
rank of the current replica, or 0.
- Return type
int
Path to a directory shared by all AdaptDL job replicas, which can be used by the application, e.g. for storing downloaded datasets or artifacts. Determined by the environment variable
ADAPTDL_SHARE_PATH
, orNone
if unset. Automatically set in AdaptDL-scheduled clusters.- Returns
shared directory path or
None
.- Return type
str