LocalSGD

This module implements a fault tolerant version of LocalSGD and related methods.

class torchft.local_sgd.DiLoCo(manager: Manager, model: Module, inner_optimizer: Optimizer, outer_optimizer: Optimizer, sync_every: int, backup_device: Optional[device] = None, pin_memory: bool = True)[source]

Bases: object

DiLoCo is a subclass of LocalSGD that overrides the synchronization mechanism to average and synchronize the pseudogradients (delta of the previous global weight and current local weights).

This algorithm requires a backup copy of the weights. By default these are stored in CPU memory. If any error occurs during the DiLoCo step, the step will be discarded and the model parameters will reset back to the last time DiLoCo synchronized.

DiLoCo paper: https://arxiv.org/pdf/2311.08105

sync() → None[source]: Synchronizes and averages the model weights across the manager.

class torchft.local_sgd.LocalSGD(manager: Manager, model: Module, optimizer: Optimizer, sync_every: int)[source]

Bases: object

LocalSGD is a context manager that implements the algorithm described in https://arxiv.org/pdf/1805.09767

This will synchronize the model parameters periodically in a fault tolerant way using a torchft Manager. The allreduce on the parameters will happen every sync_every steps after the optimizer.step call.

The torchft quorum is computed at the beginning of sync_every steps. If any error occurs, or a worker fails between syncs, sync_every steps will be discarded and a new quorum will be computed on the next step.

If running in async mode, on a joining worker the first sync_every steps will discarded as the model will be recovering during that period. When using sync mode, the checkpoint will be restored prior to the first step.

sync() → None[source]: Synchronizes and averages the model weights across the manager.

torchft.local_sgd.extract_local_tensor(t: Tensor) → Tensor[source]: Returns a cloned version of the input tensor. If the input tensor is a DTensor, it extracts and clones its local representation.

LocalSGD

Docs

Tutorials

Resources