get_distributed_backend

torchtune.training.get_distributed_backend(device_type: str, offload_ops_to_cpu: bool = False) → str[source]

Gets the PyTorch Distributed backend based on device type.

Parameters:

device_type (str) – Device type to get backend for.
offload_ops_to_cpu (bool, optional) – Flag to check if any operations should be offloaded to CPU. Examples of these kinds of operations are CPU offload for FSDP and asynchronous save for distributed checkpointing. Defaults to False.

Example

>>> get_distributed_backend("cuda")
'nccl'
>>> get_distributed_backend("cpu")
'gloo'
>>> get_distributed_backend("cuda", offload_ops_to_cpu=True)
'cuda:nccl,cpu:gloo'

Returns:: Distributed backend for use in torch.distributed.init_process_group.
Return type:: str

get_distributed_backend

Docs

Tutorials

Resources