`Introduction `__ \|\| **What is DDP** \|\| `Single-Node Multi-GPU Training `__ \|\| `Fault Tolerance `__ \|\| `Multi-Node training <../intermediate/ddp_series_multinode.html>`__ \|\| `minGPT Training <../intermediate/ddp_series_minGPT.html>`__ What is Distributed Data Parallel (DDP) ======================================= Authors: `Suraj Subramanian `__ .. grid:: 2 .. grid-item-card:: :octicon:`mortar-board;1em;` What you will learn * How DDP works under the hood * What is ``DistributedSampler`` * How gradients are synchronized across GPUs .. grid-item-card:: :octicon:`list-unordered;1em;` Prerequisites * Familiarity with `basic non-distributed training `__ in PyTorch Follow along with the video below or on `youtube `__. .. raw:: html
This tutorial is a gentle introduction to PyTorch `DistributedDataParallel `__ (DDP) which enables data parallel training in PyTorch. Data parallelism is a way to process multiple data batches across multiple devices simultaneously to achieve better performance. In PyTorch, the `DistributedSampler `__ ensures each device gets a non-overlapping input batch. The model is replicated on all the devices; each replica calculates gradients and simultaneously synchronizes with the others using the `ring all-reduce algorithm `__. This `illustrative tutorial `__ provides a more in-depth python view of the mechanics of DDP. Why you should prefer DDP over ``DataParallel`` (DP) ---------------------------------------------------- `DataParallel `__ is an older approach to data parallelism. DP is trivially simple (with just one extra line of code) but it is much less performant. DDP improves upon the architecture in a few ways: +---------------------------------------+------------------------------+ | ``DataParallel`` | ``DistributedDataParallel`` | +=======================================+==============================+ | More overhead; model is replicated | Model is replicated only | | and destroyed at each forward pass | once | +---------------------------------------+------------------------------+ | Only supports single-node parallelism | Supports scaling to multiple | | | machines | +---------------------------------------+------------------------------+ | Slower; uses multithreading on a | Faster (no GIL contention) | | single process and runs into Global | because it uses | | Interpreter Lock (GIL) contention | multiprocessing | +---------------------------------------+------------------------------+ Further Reading --------------- - `Multi-GPU training with DDP `__ (next tutorial in this series) - `DDP API `__ - `DDP Internal Design `__ - `DDP Mechanics Tutorial `__