Shortcuts

PyTorch Distributed Overview

Author: Will Constable

Note

edit View and edit this tutorial in github.

This is the overview page for the torch.distributed package. The goal of this page is to categorize documents into different topics and briefly describe each of them. If this is your first time building distributed training applications using PyTorch, it is recommended to use this document to navigate to the technology that can best serve your use case.

Introduction

The PyTorch Distributed library includes a collective of parallelism modules, a communications layer, and infrastructure for launching and debugging large training jobs.

Parallelism APIs

These Parallelism Modules offer high-level functionality and compose with existing models:

Sharding primitives

DTensor and DeviceMesh are primitives used to build parallelism in terms of sharded or replicated tensors on N-dimensional process groups.

  • DTensor represents a tensor that is sharded and/or replicated, and communicates automatically to reshard tensors as needed by operations.

  • DeviceMesh abstracts the accelerator device communicators into a multi-dimensional array, which manages the underlying ProcessGroup instances for collective communications in multi-dimensional parallelisms. Try out our Device Mesh Recipe to learn more.

Communications APIs

The PyTorch distributed communication layer (C10D) offers both collective communication APIs (e.g., all_reduce

and all_gather) and P2P communication APIs (e.g., send and isend), which are used under the hood in all of the parallelism implementations. Writing Distributed Applications with PyTorch shows examples of using c10d communication APIs.

Launcher

torchrun is a widely-used launcher script, which spawns processes on the local and remote machines for running distributed PyTorch programs.

Applying Parallelism To Scale Your Model

Data Parallelism is a widely adopted single-program multiple-data training paradigm where the model is replicated on every process, every model replica computes local gradients for a different set of input data samples, gradients are averaged within the data-parallel communicator group before each optimizer step.

Model Parallelism techniques (or Sharded Data Parallelism) are required when a model doesn’t fit in GPU, and can be combined together to form multi-dimensional (N-D) parallelism techniques.

When deciding what parallelism techniques to choose for your model, use these common guidelines:

  1. Use DistributedDataParallel (DDP), if your model fits in a single GPU but you want to easily scale up training using multiple GPUs.

  2. Use FullyShardedDataParallel (FSDP) when your model cannot fit on one GPU.

  3. Use Tensor Parallel (TP) and/or Pipeline Parallel (PP) if you reach scaling limitations with FSDP.

Note

Data-parallel training also works with Automatic Mixed Precision (AMP).

PyTorch Distributed Developers

If you’d like to contribute to PyTorch Distributed, refer to our Developer Guide.

Docs

Access comprehensive developer documentation for PyTorch

View Docs

Tutorials

Get in-depth tutorials for beginners and advanced developers

View Tutorials

Resources

Find development resources and get your questions answered

View Resources