adaptdl.checkpoint module
This module provides functionality to Save and load arbitrary state as part of checkpoint-restart elasticity. The State class can be subclassed to define how to save/load any state to/from persistent storage, so it can be restored after the current job restarts and resumed from where it left off.
- class adaptdl.checkpoint.State(name)[source]
Bases:
object
This class implements An arbitrary piece of state which can be saved and loaded as part of a checkpoint, and synchronized across all replicas. Should be sub-classed to define custom save, load, and sync logic.
- load(fileobj)[source]
This method should be overridden by subclasses to define how the state is loaded. Is invoked by load_state to load the state from persistent storage.
- Parameters
fileobj (BinaryIO) – A binary readable file object.
- adaptdl.checkpoint.load_state(state)[source]
Load the given State object from persistent storage. If the object was previously saved, then State.load will be invoked with a readable file object to load from.
- Parameters
state (State) – State object to load from persistent storage.
- Returns
True if state was previously saved and State.load was invoked, False otherwise.
- adaptdl.checkpoint.save_all_states()[source]
Invokes save_state on all State objects for which State.skip is True. This function can be used to trigger a global checkpoint and save every State in the current job.
- adaptdl.checkpoint.save_state(state, checkpoint_dir, sync=True)[source]
Saves a State object to persistent storage. First invokes State.sync on all replicas if sync is True (default), and then invokes State.save on the replica of rank 0 only. Note that we save state to a temporary folder first. Then, it will be renamed to the formal checkpoint folder after all states are saved.
- Parameters
state (State) – The State object to save to persistent storage.
sync (bool) – Whether State.sync should be invoked.