adaptdl.checkpoint module

This module provides functionality to Save and load arbitrary state as part of checkpoint-restart elasticity. The State class can be subclassed to define how to save/load any state to/from persistent storage, so it can be restored after the current job restarts and resumed from where it left off.

class adaptdl.checkpoint.State(name)[source]

Bases: object

This class implements An arbitrary piece of state which can be saved and loaded as part of a checkpoint, and synchronized across all replicas. Should be sub-classed to define custom save, load, and sync logic.

load(fileobj)[source]

This method should be overridden by subclasses to define how the state is loaded. Is invoked by load_state to load the state from persistent storage.

Parameters

fileobj (BinaryIO) – A binary readable file object.

save(fileobj)[source]

This method should be overridden by subclasses to define how the state is saved. Is invoked by save_all_states and save_state to save the state into persistent storage.

Parameters

fileobj (BinaryIO) – A binary writable file object.

sync()[source]

This method should be overridden by subclasses to define how the state is synchronized across replicas. This might be necessary to make sure the state is consistent before saving it to persistent storage. Is invoked by save_state before saving the state.

adaptdl.checkpoint.load_state(state)[source]

Load the given State object from persistent storage. If the object was previously saved, then State.load will be invoked with a readable file object to load from.

Parameters

state (State) – State object to load from persistent storage.

Returns

True if state was previously saved and State.load was invoked, False otherwise.

adaptdl.checkpoint.save_all_states()[source]

Invokes save_state on all State objects for which State.skip is True. This function can be used to trigger a global checkpoint and save every State in the current job.

adaptdl.checkpoint.save_state(state, checkpoint_dir, sync=True)[source]

Saves a State object to persistent storage. First invokes State.sync on all replicas if sync is True (default), and then invokes State.save on the replica of rank 0 only. Note that we save state to a temporary folder first. Then, it will be renamed to the formal checkpoint folder after all states are saved.

Parameters
  • state (State) – The State object to save to persistent storage.

  • sync (bool) – Whether State.sync should be invoked.