adaptdl.collective module¶
This module contains simple collective communications primitives which operate on arbitrary python objects. It is meant to be general but non-performant. Only use these primitives if you are synchronizing small objects which can be efficiently pickled and operated on. For larger objects, use framework-specific functions, such as those provided by torch.distributed.
The functions in this module should be invoked in the same order across all replicas in the current job. Otherwise, their behavior is undefined and you may encounter unexpected bugs and errors.
- adaptdl.collective.allreduce(value, reduce_fn=<function default_reduce_fn>)[source]¶
Reduces a value across all replicas in such a way that they all get the final result. Blocks until this function is invoked by all replicas.
- Parameters
value (object) – The object which will be reduced together with all other replicas.
reduce_fn (Function) – A reduction function which two objects as arguments, and returns the resulting reduced object.
- Returns
Resulting value after being reduced across all replicas.
- Return type
object
- Raises
RuntimeError – If this module has not been initialized.
- adaptdl.collective.allreduce_async(value, reduce_fn=<function default_reduce_fn>)[source]¶
Asynchronous version of the allreduce function. Does not block, instead returns a future which can be used to obtain the result later.
- Parameters
value (object) – The object which will be reduced together with all other replicas.
reduce_fn (Function) – A reduction function which two objects as arguments, and returns the resulting reduced object.
- Returns
Object from which the result can be obtained later.
- Return type
- Raises
RuntimeError – If this module has not been initialized.
- adaptdl.collective.broadcast(value)[source]¶
Broadcasts a value from the replica of rank 0 to all replicas. Blocks until this function is invoked by all replicas.
- Parameters
value (object) – The object which will be broadcasted from replica 0. Ignored on all other replicas.
- Returns
The value broadcasted from replica 0.
- Return type
object
- Raises
RuntimeError – If this module has not been initialized.
- adaptdl.collective.initialize(master_addr=None, master_port=None, replica_rank=None, num_replicas=None)[source]¶
Initialize this module, must be invoked before calling any other functions. This function will block until it has been invoked from all replicas.
- Parameters
master_addr – address of the replica with rank 0.
master_port – free port of the replica with rank 0.
replica_rank – rank of the current replica.
num_replicas – total number of replicas.
- Raises
RuntimeError – If this module had already been initialized.