adaptdl.collective module

This module contains simple collective communications primitives which operate on arbitrary python objects. It is meant to be general but non-performant. Only use these primitives if you are synchronizing small objects which can be efficiently pickled and operated on. For larger objects, use framework-specific functions, such as those provided by torch.distributed.

The functions in this module should be invoked in the same order across all replicas in the current job. Otherwise, their behavior is undefined and you may encounter unexpected bugs and errors.

adaptdl.collective.allreduce(value, reduce_fn=<function default_reduce_fn>)[source]

Reduces a value across all replicas in such a way that they all get the final result. Blocks until this function is invoked by all replicas.

Parameters
  • value (object) – The object which will be reduced together with all other replicas.

  • reduce_fn (Function) – A reduction function which two objects as arguments, and returns the resulting reduced object.

Returns

Resulting value after being reduced across all replicas.

Return type

object

Raises

RuntimeError – If this module has not been initialized.

adaptdl.collective.allreduce_async(value, reduce_fn=<function default_reduce_fn>)[source]

Asynchronous version of the allreduce function. Does not block, instead returns a future which can be used to obtain the result later.

Parameters
  • value (object) – The object which will be reduced together with all other replicas.

  • reduce_fn (Function) – A reduction function which two objects as arguments, and returns the resulting reduced object.

Returns

Object from which the result can be obtained later.

Return type

Future

Raises

RuntimeError – If this module has not been initialized.

adaptdl.collective.broadcast(value)[source]

Broadcasts a value from the replica of rank 0 to all replicas. Blocks until this function is invoked by all replicas.

Parameters

value (object) – The object which will be broadcasted from replica 0. Ignored on all other replicas.

Returns

The value broadcasted from replica 0.

Return type

object

Raises

RuntimeError – If this module has not been initialized.

adaptdl.collective.initialize(master_addr=None, master_port=None, replica_rank=None, num_replicas=None)[source]

Initialize this module, must be invoked before calling any other functions. This function will block until it has been invoked from all replicas.

Parameters
  • master_addr – address of the replica with rank 0.

  • master_port – free port of the replica with rank 0.

  • replica_rank – rank of the current replica.

  • num_replicas – total number of replicas.

Raises

RuntimeError – If this module had already been initialized.

adaptdl.collective.teardown()[source]

Teardown this module, will block until this function has been invoked from all replicas.

Raises

RuntimeError – If this module has not been initialized.