adaptdl.torch.data module

class adaptdl.torch.data.AdaptiveDataLoader(dataset, batch_size=1, shuffle=False, **kwargs)[source]

Bases: torch.utils.data.dataloader.DataLoader, adaptdl.torch.data.AdaptiveDataLoaderMixin

This class is a PyTorch DataLoader that also supports adaptive batch sizes and checkpoint-restart elasticity. Applications can typically use objects of this class as direct replacements for PyTorch DataLoaders. However, some notable differences are:

  1. The batch_size argument defines the target total batch size across all replicas, rather than the local batch size on each replica.

  2. Custom sampler and batch_sampler are not supported.

  3. Iterating through the dataloader is only allowed from within an epoch loop (see adaptdl.torch.epoch), and only one dataloader loop is allowed at any given time.

Parameters
  • dataset (torch.util.data.Dataset) – Dataset from which to load the data.

  • batch_size (int) – The target total batch size across all replicas. The actual total batch size may be different due to rounding (each replica must have the same local batch size), or being scaled up using adaptive batch sizes.

  • shuffle (bool) – Whether the data is reshuffled at every epoch.

  • **kwargs – Keyword arguments passed to torch.util.data.Dataloader.

Raises

ValueError – If sampler or batch_sampler are not None.

__iter__()[source]

Iterate over batches of data. When adaptive batch size is disabled, stops after the entire dataset has been processed once in total by all replicas. This means if there are K replicas, then this method will iterate over ~1/K of the dataset. When adaptive batch size is enabled, stops after making enough statistical progress roughly equivalent to one pass over the dataset with non-adaptive batch size. In this case, the dataset may be processed more than once.

A checkpoint-restart may be triggered in-between each batch. In this case, the current iteration state will be saved and restored after the restart, and continue where it left off.

class adaptdl.torch.data.AdaptiveDataLoaderHelper(batch_size=1)[source]

Bases: object

This class provides fine-grained control over adaptive training loops. It can be used for building more user-friendly custom data loaders, such as AdaptiveDataLoader.

Parameters

batch_size (int) – The target total batch size across all replicas. The actual total batch size may be different due to rounding (each replica must have the same local batch size), or being scaled up using adaptive batch sizes.

property accumulation_steps

The number of batches returned by the dataloader before a step is taken.

autoscale_batch_size(max_batch_size, local_bsz_bounds=None, gradient_accumulation=False)[source]

Enables adaptive batch size. Should be invoked once after the data loader object is created.

Parameters
  • max_batch_size (int) – Maximum total batch size allowed.

  • local_bsz_bounds (tuple) – A pair of (min_local_bsz, max_local_bsz), the min and max local batch sizes allowed on each replica.

Raises

ValueError – If any of the provided batch size bounds are invalid.

context()[source]

All iterators should be iterated under this context. It ensures proper cleanup of elastic context at the end of each epoch.

property current_batch_size
property current_index

The total number of data samples processed so far in the current loop. Includes the data processed by all replicas. None if this data loader is not currently being iterated.

property current_local_bsz

The current logical local batch size used by the dataloader. The batch size returned by the dataloader may be smaller if gradient accumulation is used

property end_index

(Optional) Can be used to track the end index of dataset across restarts.

is_accum_step()[source]

Whether the current step’s gradient will be accumulated.

is_optim_step()[source]

Whether the optimizer step will be invoked in this step.

property local_bsz_bounds

The local batch size bounds on each replica. A pair of integers, (min_local_bsz, max_local_bsz).

property max_batch_size

The maximum total batch size allowed for adaptive batch size. None if adaptive batch size is disabled.

profile(commit)[source]

Every iteration of every epoch should be profiled under this context. Note that, custom DataLoader writers should make sure that it gets called equal number of times on each replica.

Parameters

commit (bool) – Whether to commit the profiled results.

skipdone()[source]

Should be called just after entering the _elastic context to make sure that the dataloader loop is not replayed if has already finished before a restart.

to_tensorboard(writer, global_step, tag_prefix='')[source]

Output some useful metrics to TensorBoard.

Parameters
  • writer (torch.utils.tensorboard.SummaryWriter) – SummaryWriter object to output metrics to.

  • global_step (int) – Global step value to record.

  • tag_prefix (str) – Prefix added to each metric’s tag.

train()[source]

Set this data loader to be the one used for training. Only one data loader may be used for training.

property training
class adaptdl.torch.data.AdaptiveDataLoaderMixin(batch_size)[source]

Bases: object

This class provides elastic functionality to any custom DataLoader which inherits it. It defines a member _elastic of type AdaptiveDataLoaderHelper which has useful methods and members to implement restart-safe, elastic DataLoaders. It also exposes public methods which can be used inside training loops directly from AdaptiveDataLoader.

property accumulation_steps

The number of batches returned by the dataloader before a step is taken.

autoscale_batch_size(max_batch_size, local_bsz_bounds=None, gradient_accumulation=False)[source]
property current_batch_size
property current_local_bsz
to_tensorboard(writer, global_step, tag_prefix='')[source]

Output some useful metrics to TensorBoard.

Parameters
  • writer (torch.utils.tensorboard.SummaryWriter) – SummaryWriter object to output metrics to.

  • global_step (int) – Global step value to record.

  • tag_prefix (str) – Prefix added to each metric’s tag.

property training
class adaptdl.torch.data.ElasticSampler(dataset, shuffle=True)[source]

Bases: torch.utils.data.sampler.Sampler

A PyTorch Sampler which partitions data samples across multiple replicas, and supports deterministic continuing across checkpoint-restarts. Shuffling is deterministic for each epoch, and ElasticSampler.set_epoch() should be invoked to obtain different orderings in different epochs.

Parameters
  • dataset (torch.util.data.Dataset) – The dataset to sample from.

  • shuffle (bool) – Whether the data samples should be shuffled.

__iter__()[source]

Iterate through the samples in the dataset, in the order defined for a set epoch, starting at a set index. Produces only the indices for the local replica.

Returns: Iterator over data sample indices.

__len__()[source]

The total number of samples to be iterated through, starting at the set index, for the local replica.

Returns (int): Number of samples.

set_epoch(epoch, index=0)[source]

Set the epoch to derive samples from. Optional argument index can be specified to start sampling from a particular index, e.g. after a checkpoint-restart.

Parameters
  • epoch (int) – The epoch to sample from.

  • index (int) – The index to start sampling from.

adaptdl.torch.data.current_dataloader()[source]

Reference to the data loader currently being iterated.

Returns (AdaptiveDataLoaderHelper): Current data loader.