Introduction || What is DDP || Single-Node Multi-GPU Training || Fault Tolerance || Multi-Node training || minGPT Training

Distributed Data Parallel in PyTorch - Video Tutorials

Created On: Sep 27, 2022 | Last Updated: Nov 15, 2024 | Last Verified: Nov 05, 2024

Authors: Suraj Subramanian

Follow along with the video below or on youtube.

This series of video tutorials walks you through distributed training in PyTorch via DDP.

The series starts with a simple non-distributed training job, and ends with deploying a training job across several machines in a cluster. Along the way, you will also learn about torchrun for fault-tolerant distributed training.

The tutorial assumes a basic familiarity with model training in PyTorch.

Running the code

You will need multiple CUDA GPUs to run the tutorial code. Typically, this can be done on a cloud instance with multiple GPUs (the tutorials use an Amazon EC2 P3 instance with 4 GPUs).

The tutorial code is hosted in this github repo. Clone the repository and follow along!

Tutorial sections

Introduction (this page)
What is DDP? Gently introduces what DDP is doing under the hood
Single-Node Multi-GPU Training Training models using multiple GPUs on a single machine
Fault-tolerant distributed training Making your distributed training job robust with torchrun
Multi-Node training Training models using multiple GPUs on multiple machines
Training a GPT model with DDP “Real-world” example of training a minGPT model with DDP

Distributed Data Parallel in PyTorch - Video Tutorials

Running the code

Tutorial sections

Docs

Tutorials

Resources